Proceedings of Machine Learning Research

A new ranking scheme for modern data and its application to two-sample hypothesis testing

Wed, 12 Jul 2023 00:00:00 +0000

Rank-based approaches are among the most popular nonparametric methods for univariate data in tackling statistical problems such as hypothesis testing due to their robustness and effectiveness. However, they are unsatisfactory for more complex data. In the era of big data, high-dimensional and non-Euclidean data, such as networks and images, are ubiquitous and pose challenges for statistical analysis. Existing multivariate ranks such as component-wise, spatial, and depth-based ranks do not apply to non-Euclidean data and have limited performance for high-dimensional data. Instead of dealing with the ranks of observations, we propose two types of ranks applicable to complex data based on a similarity graph constructed on observations: a graph-induced rank defined by the inductive nature of the graph and an overall rank defined by the weight of edges in the graph. To illustrate their utilization, both the new ranks are used to construct test statistics for the two-sample hypothesis testing, which converge to the $\chi_2^2$ distribution under the permutation null distribution and some mild conditions of the ranks, enabling an easy type-I error control. Simulation studies show that the new method exhibits good power under a wide range of alternatives compared to existing methods. The new test is illustrated on the New York City taxi data for comparing travel patterns in consecutive months and a brain network dataset comparing male and female subjects.

The Aggregation–Heterogeneity Trade-off in Federated Learning

Wed, 12 Jul 2023 00:00:00 +0000

Conventional wisdom in machine learning holds that the more data you train your model on, the better the model can perform. Accordingly, a plethora of federated learning methods have been developed to aggregate as many local samples as possible. Contrary to this belief, this paper shows that aggregation of more data is not necessarily beneficial in the presence of heterogeneity, and reveals a fundamental trade-off between aggregation and heterogeneity in federated learning. We consider a general family of weighted $M$-estimators that interpolate between FedAvg and the local estimator, in which an aggregation rule is determined by the weights of local samples. We derive an upper bound for the estimation error of the weighted $M$-estimators, which decomposes into a bias term induced by heterogeneity and a variance term influenced by aggregation. A measure of heterogeneity, the federated smoothness $\beta$, is introduced to simplify the general result. As an important consequence, the optimal aggregation rule for each local device is to aggregate only its $\lfloor K^{2\beta/(2\beta+1)}/(n/\sigma^2)^{1/(2\beta+1)}\rfloor\vee 1$ closest neighbors among the $K$ devices, where $n$ is the local sample size and $\sigma^2$ is the noise variance. Moreover, we show that our estimator, termed FedKNN, attains the minimax optimal rate over a certain parameter space characterized by $\beta$. This optimal procedure depends crucially on the neighboring structure among devices in terms of the proximity of local parameters. Finally, we prove that without such prior knowledge no estimator can achieve a convergence rate faster than $O(\sigma^2/n)$ and hence adaptation is impossible.

Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

Wed, 12 Jul 2023 00:00:00 +0000

Recently, several studies \citep{zhou2021nearly, zhang2021variance, kim2021improved, zhou2022computationally} have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the \emph{first computationally efficient} algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$ regret, where $\sigma_k^2$ is the \emph{variance} of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Wed, 12 Jul 2023 00:00:00 +0000

We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm, $\mathtt{UCB-AVG}$, based on an optimistic variant of variance-reduced Q-learning. We show that $\mathtt{UCB-AVG}$ achieves a regret bound $\otilde(S^5A^2\spn(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $\mathrm{sp}(h^*)$ the span of the optimal bias function.\footnote{We use the notation $\widetilde{O}(\cdot)$ to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of $\mathtt{UCB-AVG}$ to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\otilde\left( {SA\spn^2(h^*)\epsilon^{-2}} + S^2A\spn(h^*)\epsilon^{-1} \right).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SA\spn(^*)\epsilon^{-2})$ established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on $\mix,$ the \emph{worst-case} mixing time induced by a policy. We remark that the diameter $D$ and mixing time $\mix$ are both lower bounded by $\spn(h^*)$, and $\mix$ can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity.

Improved Discretization Analysis for Underdamped Langevin Monte Carlo

Wed, 12 Jul 2023 00:00:00 +0000

Underdamped Langevin Monte Carlo (ULMC) is an algorithm used to sample from unnormalized densities by leveraging the momentum of a particle moving in a potential well. We provide a novel analysis of ULMC, motivated by two central questions: (1) Can we obtain improved sampling guarantees beyond strong log-concavity? (2) Can we achieve acceleration for sampling?For (1), prior results for ULMC only hold under a log-Sobolev inequality together with a restrictive Hessian smoothness condition. Here, we relax these assumptions by removing the Hessian smoothness condition and by considering distributions satisfying a Poincare inequality. Our analysis achieves the state of art dimension dependence, and is also flexible enough to handle weakly smooth potentials. As a byproduct, we also obtain the first KL divergence guarantees for ULMC without Hessian smoothness under strong log-concavity, which is based on a new result on the log-Sobolev constant along the underdamped Langevin diffusion.For (2), the recent breakthrough of Cao, Lu, and Wang (2020) established the first accelerated result for sampling in continuous time via PDE methods. Our discretization analysis translates their result into an algorithmic guarantee, which indeed enjoys better condition number dependence than prior works on ULMC, although we leave open the question of full acceleration in discrete time.Both (1) and (2) necessitate Renyi discretization bounds, which are more challenging than the typically used Wasserstein coupling arguments. We address this using a flexible discretization analysis based on Girsanov’s theorem that easily extends to more general settings.

Zeroth-order Optimization with Weak Dimension Dependency

Wed, 12 Jul 2023 00:00:00 +0000

Zeroth-order optimization is a fundamental research topic that has been a focus of various learning tasks, such as black-box adversarial attacks, bandits, and reinforcement learning. However, in theory, most complexity results assert a linear dependency on the dimension of optimization variable, which implies paralyzations of zeroth-order algorithms for high-dimensional problems and cannot explain their effectiveness in practice. In this paper, we present a novel zeroth-order optimization theory characterized by complexities that exhibit weak dependencies on dimensionality. The key contribution lies in the introduction of a new factor, denoted as $\effdim_{\alpha} = \sup_{\x\in \mathbb{R}^d} \sum_{i=1}^d \sigma_i^\alpha(\nabla^2 f(\x))$ ($\alpha>0$, $\sigma_i(\cdot)$ is the $i$-th singular value in non-increasing order), which effectively functions as a measure of dimensionality. The algorithms we propose demonstrate significantly reduced complexities when measured in terms of the factor $\effdim_{\alpha}$. Specifically, we first study a well-known zeroth-order algorithm from Nesterov and Spokoiny (2017) on quadratic objectives and show a complexity of $ \mathcal{O}\left(\frac{\effdim_1 }{\sigma_d}\log(1/\epsilon)\right) $ for the strongly convex setting. For linear regression, such a complexity is dimension-free and outperforms the traditional result by a factor of $d$ under common conditions. Furthermore, we introduce novel algorithms that leverages the Heavy-ball mechanism to enhance the optimization process. By incorporating this acceleration scheme, our proposed algorithm exhibits a complexity of $ \mathcal{O}\left(\frac{\effdim_{1/2} }{\sqrt{\sigma_d}}\cdot\log{\frac{L}{\mu}}\cdot\log(1/\epsilon)\right) $. For linear regression, under some mild conditions, it is faster than state-of-the-art algorithms by $\sqrt{d}$. We further expand the scope of the method to encompass generic smooth optimization problems, while incorporating an additional Hessian-smooth condition. By considering this extended framework, our approach becomes applicable to a broader range of optimization scenarios. The resultant algorithms demonstrate remarkable complexities, with dimension-independent dominant terms that surpass existing algorithms by an order in $d$ under appropriate conditions. Our analysis lays the foundation for investigating zeroth-order optimization methods for smooth functions within high-dimensional settings.

On the Lower Bound of Minimizing Polyak-Łojasiewicz functions

Wed, 12 Jul 2023 00:00:00 +0000

Polyak-Łojasiewicz (PL) (Polyak, 1963) condition is a weaker condition than the strong convexity but suffices to ensure a global convergence for the Gradient Descent algorithm. In this paper, we study the lower bound of algorithms using first-order oracles to find an approximate optimal solution. We show that any first-order algorithm requires at least ${\Omega}\left(\frac{L}{\mu}\log\frac{1}{\epsilon}\right)$ gradient costs to με find an $\epsilon$-approximate optimal solution for a general $L$-smooth function that has an $\mu$-PL constant. This result demonstrates the optimality of the Gradient Descent algorithm to minimize smooth PL functions in the sense that there exists a “hard” PL function such that no first-order algorithm can be faster than Gradient Descent when ignoring a numerical constant. In contrast, it is well-known that the momentum technique, e.g. Nesterov (2003, chap. 2), can provably accelerate Gradient Descent to ${O}\left(\sqrt{\frac{L}{\hat{\mu}}}\log\frac{1}{\epsilon}\right)$ gradient costs for functions that are $L$-smooth and $\hat{\mu}$-strongly convex. Therefore, our result distinguishes the hardness of minimizing a smooth PL function and a smooth strongly convex function as the complexity of the former cannot be improved by any polynomial order in general.

On a Class of Gibbs Sampling over Networks

Wed, 12 Jul 2023 00:00:00 +0000

We consider the sampling problem from a composite distribution whose potential (negative log density) is $\sum_{i=1}^n f_i(x_i)+\sum_{j=1}^m g_j(y_j)+\sum_{i=1}^n\sum_{j=1}^m\nicefrac{\sigma_{ij}}{2\eta} \Vert x_i-y_j \Vert^2_2$ where each of $x_i$ and $y_j$ is in $\Rd$, $f_1, f_2, \ldots, f_n, g_1, g_2, \ldots, g_m$ are strongly convex functions, and $\{\sigma_{ij}\}$ encodes a network structure. Building on the Gibbs sampling method, we develop an efficient sampling framework for this problem when the network is a bipartite graph. More importantly, we establish a non-asymptotic linear convergence rate for it. This work extends earlier works that involve only a graph with two nodes \cite{lee2021structured}. To the best of our knowledge, our result represents the first non-asymptotic analysis of a Gibbs sampler for structured log-concave distributions over networks.Our framework can be potentially used to sample from the distribution $ \propto \exp[-\sum_{i=1}^n f_i(x)-\sum_{j=1}^m g_j(x)]$ in a distributed manner.

Information-Directed Selection for Top-Two Algorithms

Wed, 12 Jul 2023 00:00:00 +0000

We consider the best-k-arm identification problem for multi-armed bandits, where the objective is to select the exact set of k arms with the highest mean rewards by sequentially allocating measurement effort. We characterize the necessary and sufficient conditions for the optimal allocation using dual variables. Remarkably these optimality conditions lead to the extension of top-two algorithm design principle (Russo, 2020), initially proposed for best-arm identification. Furthermore, our optimality conditions induce a simple and effective selection rule dubbed information-directed selection (IDS) that selects one of the top-two candidates based on a measure of information gain. As a theoretical guarantee, we prove that integrated with IDS, top-two Thompson sampling is (asymptotically) optimal for Gaussian best-arm identification, solving a glaring open problem in the pure exploration literature (Russo, 2020). As a by-product, we show that for $k > 1$, top-two algorithms cannot achieve optimality even with an oracle tuning parameter. Numerical experiments show the superior performance of the proposed top-two algorithms with IDS and considerable improvement compared with algorithms without adaptive selection.

Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

Wed, 12 Jul 2023 00:00:00 +0000

We revisit the canonical problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD’s dynamics. Along the way, we prove gradient descent automatically balances student neurons and uses this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.

Lower Bounds for the Convergence of Tensor Power Iteration on Random Overcomplete Models

Wed, 12 Jul 2023 00:00:00 +0000

Tensor decomposition serves as a powerful primitive in statistics and machine learning, and has numerous applications in problems such as learning latent variable models or mixture of Gaussians. In this paper, we focus on using power iteration to decompose an overcomplete random tensor. Past work studying the properties of tensor power iteration either requires a non-trivial data-independent initialization, or is restricted to the undercomplete regime. Moreover, several papers implicitly suggest that logarithmically many iterations (in terms of the input dimension) are sufﬁcient for the power method to recover one of the tensor components.Here we present a novel analysis of the dynamics of tensor power iteration from random initialization in the overcomplete regime, where the tensor rank is much greater than its dimension. Surprisingly, we show that polynomially many steps are necessary for convergence of tensor power iteration to any of the true component, which refutes the previous conjecture. On the other hand, our numerical experiments suggest that tensor power iteration successfully recovers tensor components for a broad range of parameters in polynomial time. To further complement our empirical evidence, we prove that a popular objective function for tensor decomposition is strictly increasing along the power iteration path.Our proof is based on the Gaussian conditioning technique, which has been applied to analyze the approximate message passing (AMP) algorithm. The major ingredient of our argument is a conditioning lemma that allows us to generalize AMP-type analysis to non-proportional limit and polynomially many iterations of the power method.

Online Learning in Dynamically Changing Environments

Wed, 12 Jul 2023 00:00:00 +0000

We study the problem of online learning and online regret minimization when samples are drawn from a general unknown \emph{non-stationary} process. We introduce the concept of a \emph{dynamic changing process} with cost $K$, where the \emph{conditional} marginals of the process can vary arbitrarily, but that the number of different conditional marginals is bounded by $K$ over $T$ rounds. For such processes we prove a tight (upto $\sqrt{\log T}$ factor) bound $O(\sqrt{KT\cdot\vch\log T})$ for the \emph{expected worst case} regret of any finite VC-dimensional class $\mathcal{H}$ under absolute loss (i.e., the expected miss-classification loss). We then improve this bound for general mixable losses, by establishing a tight (up to $\log^3 T$ factor) regret bound $O(K\cdot\vch\log^3 T)$. We extend these results to general \emph{smooth adversary} processes with \emph{unknown} reference measure by showing a sub-linear regret bound for $1$-dimensional threshold functions under a general bounded convex loss. Our results can be viewed as a first step towards regret analysis with non-stationary samples in the \emph{distribution blind} (universal) regime. This also brings a new viewpoint that shifts the study of complexity of the hypothesis classes to the study of the complexity of processes generating data.

Open Problem: Learning sparse linear concepts by priming the features

Wed, 12 Jul 2023 00:00:00 +0000

Sparse linear problems can be learned well with online multiplicative updates. The question is weather there are closed form updates based on the past examples that can sample efficiently learn such sparse linear problems as well?We show experimentally that this can be achieved by applying linear least squares, then “priming” the ith features of the past instances by multiplying them by the ith linear least squares weight, and finally applying linear least squares a second time. However it is an open problem whether such priming methods have provably good regret bounds when applied online?

On Classification-Calibration of Gamma-Phi Losses

Wed, 12 Jul 2023 00:00:00 +0000

Gamma-Phi losses constitute a family of multiclass classification loss functions that generalize the logistic and other common losses, and have found application in the boosting literature. We establish the first general sufficient condition for the classification-calibration (CC) of such losses. To our knowledge, this sufficient condition gives the first family of nonconvex multiclass surrogate losses for which CC has been fully justified. In addition, we show that a previously proposed sufficient condition is in fact not sufficient. This contribution highlights a technical issue that is important in the study of multiclass CC but has been neglected in priorwork.

Breaking the Curse of Multiagency: Provably Efficient Decentralized Multi-Agent RL with Function Approximation

Wed, 12 Jul 2023 00:00:00 +0000

A unique challenge in Multi-Agent Reinforcement Learning (MARL) is the \emph{curse of multiagency}, where the description length of the game as well as the complexity of many existing learning algorithms scale \emph{exponentially} in the number of agents. While recent work successfully addresses this challenge under the model of tabular Markov Games, their mechanisms critically rely on the number of states being finite and small, and do not extend to practical scenarios with enormous state spaces where \emph{function approximation} must be used to approximate value functions or policies. This paper presents the first line of MARL algorithms that provably resolve the curse of multiagency under function approximation. We design a new algorithm—\emph{V-Learning with Policy Replay}, which gives the first \emph{polynomial} sample complexity results for learning approximate Coarse Correlated Equilibria (CCEs) of Markov Games under decentralized linear function approximation. Our algorithm always outputs Markov CCEs, and its sample complexity also significantly improves over the current best result for finding Markov CCEs in the tabular setting. This paper further presents an alternative algorithm—\emph{Decentralized Optimistic Policy Mirror Descent}, which finds policy-class-restricted CCEs using a polynomial number of samples. In exchange for learning a weaker version of CCEs, this algorithm applies to a wider range of problems under generic function approximation, such as linear quadratic games and MARL problems with low “marginal” Eluder dimension.

Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions

Wed, 12 Jul 2023 00:00:00 +0000

We provide a simple convergence proof for AdaGrad optimizing non-convex objectives under only affine noise variance and bounded smoothness assumptions. The proof is essentially based on a novel auxiliary function $\xi$ that helps eliminate the complexity of handling the correlation between the numerator and denominator of AdaGrad’s update. Leveraging simple proofs, we are able to obtain tighter results than existing results [Faw et al 2002] and extend the analysis to several new and important cases. Specifically, for the over-parameterized regime, we show that AdaGrad needs only $\mathcal{O}(\frac{1}{\varepsilon^2})$ iterations to ensure the gradient norm smaller than $\varepsilon$, which matches the rate of SGD and significantly tighter than existing rates $\mathcal{O}(\frac{1}{\varepsilon^4})$ for AdaGrad. We then discard the bounded smoothness assumption, and consider a realistic assumption on smoothness called $(L_0,L_1)$-smooth condition, which allows local smoothness to grow with the gradient norm. Again based on the auxiliary function $\xi$, we prove that AdaGrad succeeds in converging under $(L_0,L_1)$-smooth condition as long as the learning rate is lower than a threshold. Interestingly, we further show that the requirement on learning rate under the $(L_0,L_1)$-smooth condition is necessary via proof by contradiction, in contrast with the case of uniform smoothness conditions where convergence is guaranteed regardless of learning rate choices. Together, our analyses broaden the understanding of AdaGrad and demonstrate the power of the new auxiliary function in the investigations of AdaGrad.

Improved Dynamic Regret for Online Frank-Wolfe

Wed, 12 Jul 2023 00:00:00 +0000

To deal with non-stationary online problems with complex constraints, we investigate the dynamic regret of online Frank-Wolfe (OFW), which is an efficient projection-free algorithm for online convex optimization. It is well-known that in the setting of offline optimization, the smoothness of functions and the strong convexity of functions accompanying specific properties of constraint sets can be utilized to achieve fast convergence rates for the Frank-Wolfe (FW) algorithm. However, for OFW, previous studies only establish a dynamic regret bound of $O(\sqrt{T}(V_T+\sqrt{D_T}+1))$ by utilizing the convexity of problems, where $T$ is the number of rounds, $V_T$ is the function variation, and $D_T$ is the gradient variation. In this paper, we derive improved dynamic regret bounds for OFW by extending the fast convergence rates of FW from offline optimization to online optimization. The key technique for this extension is to set the step size of OFW with a line search rule. In this way, we first show that the dynamic regret bound of OFW can be improved to $O(\sqrt{T(V_T+1)})$ for smooth functions. Second, we achieve a better dynamic regret bound of $O(T^{1/3}(V_T+1)^{2/3})$ when functions are smooth and strongly convex, and the constraint set is strongly convex. Finally, for smooth and strongly convex functions with minimizers in the interior of the constraint set, we demonstrate that the dynamic regret of OFW reduces to $O(V_T+1)$, and can be further strengthened to $O(\min\{P_T^\ast,S_T^\ast,V_T\}+1)$ by performing a constant number of FW iterations per round, where $P_T^\ast$ and $S_T^\ast$ denote the path length and squared path length of minimizers, respectively.

Instance-Optimality in Interactive Decision Making: Toward a Non-Asymptotic Theory

Wed, 12 Jul 2023 00:00:00 +0000

We consider the development of adaptive, instance-dependent algorithms for interactive decision making (bandits, reinforcement learning, and beyond) that, rather than only performing well in the worst case, adapt to favorable properties of real-world instances for improved performance. We aim for instance-optimality, a strong notion of adaptivity which asserts that, on any particular problem instance, the algorithm under consideration outperforms all consistent algorithms. Instance-optimality enjoys a rich asymptotic theory originating from the work of \citet{lai1985asymptotically} and \citet{graves1997asymptotically}, but non-asymptotic guarantees have remained elusive outside of certain special cases. Even for problems as simple as tabular reinforcement learning, existing algorithms do not attain instance-optimal performance until the number of rounds of interaction is doubly exponential in the number of states.In this paper, we take the first step toward developing a non-asymptotic theory of instance-optimal decision making with general function approximation. We introduce a new complexity measure, the Allocation-Estimation Coefficient (AEC), and provide a new algorithm, AE2, which attains non-asymptotic instance-optimal performance at a rate controlled by the AEC. Our results recover the best known guarantees for well-studied problems such as finite-armed and linear bandits and, when specialized to tabular reinforcement learning, attain the first instance-optimal regret bounds with polynomial dependence on all problem parameters, improving over prior work exponentially. We complement these results with lower bounds that show that i) existing notions of statistical complexity are insufficient to derive non-asymptotic guarantees, and ii) under certain technical conditions, boundedness of the Allocation-Estimation Coefficient is necessary to learn an instance-optimal allocation of decisions in finite time.

Linearization Algorithms for Fully Composite Optimization

Wed, 12 Jul 2023 00:00:00 +0000

This paper studies first-order algorithms for solving fully composite optimization problems over convex and compact sets. We leverage the structure of the objective by handling its differentiable and non-differentiable components separately, linearizing only the smooth parts. This provides us with new generalizations of the classical Frank-Wolfe method and the Conditional Gradient Sliding algorithm, that cater to a subclass of non-differentiable problems. Our algorithms rely on a stronger version of the linear minimization oracle, which can be efficiently implemented in several practical applications. We provide the basic version of our method with an affine-invariant analysis and prove global convergence rates for both convex and non-convex objectives. Furthermore, in the convex case, we propose an accelerated method with correspondingly improved complexity. Finally, we provide illustrative experiments to support our theoretical results.

Hardness of Agnostically Learning Halfspaces from Worst-Case Lattice Problems

Wed, 12 Jul 2023 00:00:00 +0000

We show hardness of improperly learning halfspaces in the agnostic model, both in the distribution-independent as well as the distribution-specific setting, based on the assumption that worst-case lattice problems, e.g., approximating shortest vectors within polynomial factors, are hard.In particular, we show that under this assumption there is no efficient algorithm that outputs any binary hypothesis, not necessarily a halfspace, achieving misclassfication error better than $\frac 1 2 - \gamma$ even if the optimal misclassification error is as small is as small as $\delta$.Here, $\gamma$ can be smaller than the inverse of any polynomial in the dimension and $\delta$ as small as $\exp(-\Omega(\log^{1-c}(d)))$, where $0 < c < 1$ is an arbitrary constant and $d$ is the dimension.For the distribution-specific setting, we show that if the marginal distribution is standard Gaussian, for any $\beta > 0$ learning halfspaces up to error $\OPT_\LTF + \varepsilon$ takes time at least $d^{\tilde{\Omega}(1/\varepsilon^{2-\beta})}$ under the same hardness assumptions.Similarly, we show that learning degree-$\ell$ polynomial threshold functions up to error $\OPT_{\PTF_\ell} + \e$ takes time at least $d^{\tilde{\Omega}(\ell^{2-\beta}/\varepsilon^{4-2\beta})}$.$\OPT_\LTF$ and $\OPT_{\PTF_\ell}$ denote the best error achievable by any halfspace or polynomial threshold function, respectively.Our lower bounds qualitively match algorithmic guarantees and (nearly) recover known lower bounds based on non-worst-case assumptions.Previously, such hardness results [D16, DKPZ21] were based on average-case complexity assumptions, specifically, variants of Feige’s random 3SAT hypothesis, or restricted to the statistical query model.Our work gives the first hardness results basing these fundamental learning problems on well-understood worst-case complexity assumption.It is inspired by a sequence of recent works showing hardness of learning well-separated Gaussian mixtures based on worst-case lattice problems.

Tight Bounds on the Hardness of Learning Simple Nonparametric Mixtures

Wed, 12 Jul 2023 00:00:00 +0000

We study the problem of learning nonparametric distributions in a finite mixture, and establish tight bounds on the sample complexity for learning the component distributions in such models.Namely, we are given i.i.d. samples from a pdf $f$ where $$f=w_1f_1+w_2f_2, \quad w_1+w_2=1, \quad w_1,w_2>0$$and we are interested in learning each component $f_i$.Without any assumptions on $f_i$, this problem is ill-posed.In order to identify the components $f_i$, we assume that each $f_i$ can be written as a convolution of a Gaussian and a compactly supported density $\nu_i$ with $\text{supp}(\nu_1)\cap \text{supp}(\nu_2)=\emptyset$.Our main result shows that $(\frac{1}{\varepsilon})^{\Omega(\log\log \frac{1}{\varepsilon})}$ samples are required for estimating each $f_i$. The proof relies on a quantitative Tauberian theorem that yields a fast rate of approximation with Gaussians, which may be of independent interest. To show this is tight, we also propose an algorithm that uses $(\frac{1}{\varepsilon})^{O(\log\log \frac{1}{\varepsilon})}$ samples to estimate each $f_i$. Unlike existing approaches to learning latent variable models based on moment-matching and tensor methods, our proof instead involves a delicate analysis of an ill-conditioned linear system via orthogonal functions.Combining these bounds, we conclude that the optimal sample complexity of this problem properly lies in between polynomial and exponential, which is not common in learning theory.

Implicit Balancing and Regularization: Generalization and Convergence Guarantees for Overparameterized Asymmetric Matrix Sensing

Wed, 12 Jul 2023 00:00:00 +0000

Recently, there has been significant progress in understanding the convergence and generalization properties of gradient-based methods for training overparameterized learning models. However, many aspects including the role of small random initialization and how the various parameters of the model are coupled during gradient-based updates to facilitate good generalization, remain largely mysterious. A series of recent papers have begun to study this role for non-convex formulations of symmetric Positive Semi-Definite (PSD) matrix sensing problems which involve reconstructing a low-rank PSD matrix from a few linear measurements. The underlying symmetry/PSDness is crucial to existing convergence and generalization guarantees for this problem. In this paper, we study a general overparameterized low-rank matrix sensing problem where one wishes to reconstruct an asymmetric rectangular low-rank matrix from a few linear measurements. We prove that an overparameterized model trained via factorized gradient descent converges to the low-rank matrix generating the measurements. We show that in this setting, factorized gradient descent enjoys two implicit properties: (1) coupling of the trajectory of gradient descent where the factors are coupled in various ways throughout the gradient update trajectory and (2) an algorithmic regularization property where the iterates show a propensity towards low-rank models despite the overparameterized nature of the factorized model. These two implicit properties in turn allow us to show that the gradient descent trajectory from small random initialization moves towards solutions that are both globally optimal and generalize well.

Tackling Combinatorial Distribution Shift: A Matrix Completion Perspective

Wed, 12 Jul 2023 00:00:00 +0000

Obtaining rigorous statistical guarantees for generalization under distribution shift remains an open and active research area. We study a setting we call \emph{combinatorial distribution shift}, where (a) under the test- and training-distributions, the labels $z$ are determined by pairs of features $(x,y)$, (b) the training distribution has coverage of certain \emph{marginal} distributions over $x$ and $y$ separately, but (c) the test distribution involves examples from a product distribution over $(x,y)$ that is \emph{not} covered by the training distribution. Focusing on the special case where the labels are given by \emph{bilinear embeddings} into a Hilbert space $\mathcal H$: $\mathbb{E}[z \mid x,y ]=⟨f_{\star}(x),g_{\star}(y)\rangle_{\mathcal{H}}$, we aim to extrapolate to a test distribution domain that is {not} covered in training, or \emph{bilinear combinatorial extrapolation}. Our setting generalizes a special case of matrix completion from missing-not-at-random data, for which all existing results require the ground-truth matrices to be either \emph{exactly low-rank}, or to exhibit very sharp spectral cutoffs. In this work, we develop a series of theoretical results that enable bilinear combinatorial extrapolation under \emph{gradual} spectral decay as observed in typical high-dimensional data, including novel algorithms, generalization guarantees, and linear-algebraic results. A key tool is a novel perturbation bound for the rank-$k$ singular value decomposition approximations between two matrices that depends on the \emph{relative} spectral gap rather than the \emph{absolute} spectral gap, a result we think may be of broader independent interest.

Online Reinforcement Learning in Stochastic Continuous-Time Systems

Wed, 12 Jul 2023 00:00:00 +0000

Linear dynamical systems that obey stochastic differential equations are canonical models. While optimal control of known systems has a rich literature, the problem is technically hard under model uncertainty and there are hardly any such result. We initiate study of this problem and aim to learn (and simultaneously deploy) optimal actions for minimizing a quadratic cost function. Indeed, this work is the first that comprehensively addresses the crucial challenge of balancing exploration versus exploitation in continuous-time systems. We present online policies that learn optimal actions fast by carefully randomizing the parameter estimates, and establish their performance guarantees: a regret bound that grows with square-root of time multiplied by the number of parameters. Implementation of the policy for a flight-control task demonstrates its efficacy. Further, we prove sharp stability results for inexact system dynamics and tightly specify the infinitesimal regret caused by sub-optimal actions. To obtain the results, we conduct a novel eigenvalue-sensitivity analysis for matrix perturbation, establish upper-bounds for comparative ratios of stochastic integrals, and introduce the new method of policy differentiation. Our analysis sheds light on fundamental challenges in continuous-time reinforcement learning and suggests a useful cornerstone for similar problems.

Orthogonal Directions Constrained Gradient Method: from non-linear equality constraints to Stiefel manifold

Wed, 12 Jul 2023 00:00:00 +0000

We consider the problem of minimizing a non-convex function over a smooth manifold M. We propose a novel algorithm, the Orthogonal Directions Constrained Gradient Method (ODCGM), which only requires computing a projection onto a vector space. ODCGM is infeasible but the iterates are constantly pulled towards the manifold, ensuring the convergence of ODCGM towards M. ODCGM is much simpler to implement than the classical methods which require the computation of a retraction. Moreover, we show that ODCGM exhibits the near-optimal oracle complexities O(1/ε^{-2}) and O(1/ε^{-4}) in the deterministic and stochastic cases, respectively. Furthermore, we establish that, under an appropriate choice of the projection metric, our method recovers the landing algorithm of Ablin and Peyré (2022), a recently introduced algorithm for optimization over the Stiefel manifold. As a result, we significantly extend the analysis of Ablin and Peyré (2022), establishingnear-optimal rates both in deterministic and stochastic frameworks. Finally, we perform numerical experiments, which shows the efficiency of ODCGM in a high-dimensional setting.

Generalization Guarantees via Algorithm-dependent Rademacher Complexity

Wed, 12 Jul 2023 00:00:00 +0000

Algorithm- and data-dependent generalization bounds are required to explain the generalization behavior of modern machine learning algorithms. In this context, there exists information theoretic generalization bounds that involve (various forms of) mutual information, as well as bounds based on hypothesis set stability. We propose a conceptually related, but technically distinct complexity measure to control generalization error, which is the empirical Rademacher complexity of an algorithm- and data-dependent hypothesis class. Combining standard properties of Rademacher complexity with the convenient structure of this class, we are able to (i) obtain novel bounds based on the finite fractal dimension, which (a) extend previous fractal dimension-type bounds from continuous to finite hypothesis classes, and (b) avoid a mutual information term that was required in prior work; (ii) we greatly simplify the proof of a recent dimension-independent generalization bound for stochastic gradient descent; and (iii) we easily recover results for VC classes and compression schemes, similar to approaches based on conditional mutual information.

The $k$-Cap Process on Geometric Random Graphs

Wed, 12 Jul 2023 00:00:00 +0000

The $k$-cap (or $k$-winners-take-all) process on a graph works as follows: in each iteration, a subset of $k$ vertices of the graph are identified as winners; the next round winners are the vertices that have the highest total degree from the current winners, with ties broken randomly. This natural process is a simple model of firing activity and inhibition in the brain and has been found to have desirable robustness properties as an activation function. We study its convergence on directed geometric random graphs in any constant dimension, revealing rather surprising behavior, with the support of the current active set converging to lie in a small ball and the active set itself remaining essentially random within that.

Exploring Local Norms in Exp-concave Statistical Learning

Wed, 12 Jul 2023 00:00:00 +0000

We consider the standard problem of stochastic convex optimization with exp-concave losses using Empirical Risk Minimization in a convex class. Answering a question raised in several prior works, we provide a $O ( d/n + 1/n \log( 1 / \delta ) )$ excess risk bound valid for a wide class of bounded exp-concave losses, where $d$ is the dimension of the convex reference set, $n$ is the sample size, and $\delta$ is the confidence level. Our result is based on a unified geometric assumption on the gradient of losses and the notion of local norms.

Near-optimal fitting of ellipsoids to random points

Wed, 12 Jul 2023 00:00:00 +0000

Given independent standard Gaussian points $v_1, \ldots, v_n$ in dimension $d$, for what values of $(n, d)$ does there exist with high probability an origin-symmetric ellipsoid that simultaneously passes through all of the points? This basic problem of fitting an ellipsoid to random points has connections to low-rank matrix decompositions, independent component analysis, and principal component analysis. Based on strong numerical evidence, Saunderson, Parrilo, and Willsky [Proc. of Conference on Decision and Control, pp. 6031-6036, 2013] conjectured that the ellipsoid fitting problem transitions from feasible to infeasible as the number of points $n$ increases, with a sharp threshold at $n \sim d^2/4$. We resolve this conjecture up to logarithmic factors by constructing a fitting ellipsoid for some $n = d^2/\mathrm{polylog}(d)$. Our proof demonstrates feasibility of the least squares construction of Saunderson et al. using a convenient decomposition of a certain non-standard random matrix and a careful analysis of its Neumann expansion via the theory of graph matrices.

Kernelized Diffusion Maps

Wed, 12 Jul 2023 00:00:00 +0000

Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigen-elements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the high-dimension $d$. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert spaces method, which adapts naturally to the regularity of the problem. We provide non-asymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nyström subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.

Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints

Wed, 12 Jul 2023 00:00:00 +0000

We study simple binary hypothesis testing under local differential privacy (LDP) and communication constraints. Our results are either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. Under both privacy and communication constraints, we develop instance-optimal, computationally efficient algorithms that achieve minimal sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set of two distributions $p$ and $q$, defined as $\mathcal{A} := \{(\mathbf{T} p, \mathbf{T} q) | \mathbf{T} \in \mathcal{C}\}$, where $\mathcal{C}$ is the set of channels characterizing the constraints.

InfoNCE Loss Provably Learns Cluster-Preserving Representations

Wed, 12 Jul 2023 00:00:00 +0000

The goal of contrasting learning is to learn a representation that preserves underlying clusters by keeping samples with similar content, e.g. the “dogness” of a dog, close to each other in the space generated by the representation. A common and successful approach for tackling this unsupervised learning problem is minimizing the InfoNCE loss associated with the training samples, where each sample is associated with their augmentations (positive samples such as rotation, crop) and a batch of negative samples (unrelated samples). To the best of our knowledge, it was unanswered if the representation learned by minimizing the InfoNCE loss preserves the underlying data clusters, as it only promotes learning a representation that is faithful to augmentations, i.e., an image and its augmentations have the same representation. Our main result is to show that the representation learned by InfoNCE with a finite number of negative samples is also consistent with respect to {\em clusters} in the data, under the condition that the augmentation sets within clusters may be non-overlapping but are close and intertwined, relative to the complexity of the learning function class.

Sparse PCA Beyond Covariance Thresholding

Wed, 12 Jul 2023 00:00:00 +0000

In the Wishart model for sparse PCAwe are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + \beta vv^\top})$, where $\beta > 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign).We show that if $n \ge \Omega(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that solves this problem as long as\[\beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,.\]Prior to this work, the best polynomial time algorithm in the regime $k\approx \sqrt{d}$, called \emph{Covariance Thresholding} (proposed in Krauthgamer et al. (2015) and analyzed in Deshpande and Montanari (2014))), required $\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$.For large enough constant $t$ our algorithm runs in polynomial time and has better guarantees than Covariance Thresholding.Previously known algorithms with such guarantees required quasi-polynomial time $d^{O(\log d)}$.Our idea is based on the idea of Alon et al. (1998) for reducing the clique size in the planted clique problem.Moreover, we show that it is possible to combine our techniques with recent results on sparse PCA with symmetric heavy-tailed noise d’Orsi et al. (2022).Their model generalizes both sparse PCA and the planted clique problem.In particular, in the regime $k \approx \sqrt{d}$ we get the first polynomial time algorithm that works with symmetric heavy-tailed noise, while the algorithm from d’Orsi et al. (2022) requires quasi-polynomial time in these settings. As a consequence, we get an algorithm that solves a problem that captures both sparse PCA and planted clique and achieves best known guarantees for both of them.In addition, we show that our techniques work with sparse PCA with adversarial perturbations studied in d’Orsi et al. (2020).This model generalizes not only sparse PCA, but also the sparse planted vector problem.As a consequence, we provide polynomial time algorithms for the sparse planted vector problem that have better guarantees thanthe state of the art in some regimes.

Conference on Learning Theory 2023: Preface

Wed, 12 Jul 2023 00:00:00 +0000

PAC Verification of Statistical Algorithms

Wed, 12 Jul 2023 00:00:00 +0000

Goldwasser et al. (2021) recently proposed the setting of PAC verification, where a hypothesis (machine learning model) that purportedly satisfies the agnostic PAC learning objective is verified using an interactive proof. In this paper we develop this notion further in a number of ways. First, we prove a lower bound of $\Omega(\sqrt{d}/\varepsilon^2)$ i.i.d. samples for PAC verification of hypothesis classes of VC dimension $d$. Second, we present a protocol for PAC verification of unions of intervals over $\mathbb{R}$ that improves upon their proposed protocol for that task, and matches our lower bound’s dependence on $d$. Third, we introduce a natural generalization of their definition to verification of general statistical algorithms, which is applicable to a wider variety of settings beyond agnostic PAC learning. Showcasing our proposed definition, our final result is a protocol for the verification of statistical query algorithms that satisfy a combinatorial constraint on their queries.

Sparsity-aware generalization theory for deep neural networks

Wed, 12 Jul 2023 00:00:00 +0000

Deep artificial neural networks achieve surprising generalization abilities that remain poorly understood. In this paper, we present a new approach to analyzing generalization for deep feed-forward ReLU networks that takes advantage of the degree of sparsity that is achieved in the hidden layer activations. By developing a framework that accounts for this reduced effective model size for each input sample, we are able to show fundamental trade-offs between sparsity and generalization. Importantly, our results make no strong assumptions about the degree of sparsity achieved by the model, and it improves over recent norm-based approaches. We illustrate our results numerically, demonstrating non-vacuous bounds when coupled with data-dependent priors even in over-parametrized settings.

Towards a Complete Analysis of Langevin Monte Carlo: Beyond Poincaré Inequality

Wed, 12 Jul 2023 00:00:00 +0000

Langevin diffusions are rapidly convergent under appropriate functional inequality assumptions. Hence, it is natural to expect that with additional smoothness conditions to handle the discretization errors, their discretizations like the Langevin Monte Carlo (LMC) converge in a similar fashion. This research program was initiated by Vempala and Wibisono (2019), who established results under log-Sobolev inequalities. Chewi et al. (2022a) extended the results to handle the case of Poincaré inequalities. In this paper, we go beyond Poincaré inequalities, and push this research program to its limit. We do so by establishing upper and lower bounds for Langevin diffusions and LMC under weak Poincaré inequalities that are satisfied by a large class of densities including polynomially-decaying heavy-tailed densities (i.e., Cauchy-type). Our results explicitly quantify the effect of the initializer on the performance of the LMC algorithm. In particular, we show that as the tail goes from sub-Gaussian, to sub-exponential, and finally to Cauchy-like, the dependency on the initial error goes from being logarithmic, to polynomial, and then finally to being exponential. This three-step phase transition is in particular unavoidable as demonstrated by our lower bounds, clearly defining the boundaries of LMC.

Local Risk Bounds for Statistical Aggregation

Wed, 12 Jul 2023 00:00:00 +0000

In the problem of aggregation, the aim is to combine a given class of base predictors to achieve predictions nearly as accurate as the best one. In this flexible framework, no assumption is made on the structure of the class or the nature of the target. Aggregation has been studied in both sequential and statistical contexts. Despite some important differences between the two problems, the classical results in both cases feature the same global complexity measure. In this paper, we revisit and tighten classical results in the theory of aggregation in the statistical setting by replacing the global complexity with a smaller, local one. Some of our proofs build on the PAC-Bayes localization technique introduced by Catoni. Among other results, we prove localized versions of the classical bound for the exponential weights estimator due to Leung and Barron and deviation-optimal bounds for the Q-aggregation estimator. These bounds improve over the results of Dai, Rigollet and Zhang for fixed design regression and the results of Lecué and Rigollet for random design regression.

Sharp thresholds in inference of planted subgraphs

Wed, 12 Jul 2023 00:00:00 +0000

We connect the study of phase transitions in high-dimensional statistical inference to the study of threshold phenomena in random graphs. A major question in the study of the Erdős–Rényi random graph $G(n,p)$ is to understand the probability, as a function of $p$, that $G(n,p)$ contains a given subgraph $H=H_n$. This was studied for many specific examples of $H$, starting with classical work of Erdős and Rényi (1960). More recent work studies this question for general $H$, both in building a general theory of sharp versus coarse transitions (Friedgut and Bourgain 1999; Hatami, 2012) and in results on the location of the transition (Kahn and Kalai, 2007; Talagrand, 2010; Frankston, Kahn, Narayanan, Park, 2019; Park and Pham, 2022).In inference problems, one often studies the optimal accuracy of inference as a function of the amount of noise. In a variety of sparse recovery problems, an “all-or-nothing (AoN) phenomenon” has been observed: Informally, as the amount of noise is gradually increased, at some critical threshold the inference problem undergoes a sharp jump from near-perfect recovery to near-zero accuracy (Gamarnik and Zadik, 2017; Reeves, Xu, Zadik, 2021). We can regard AoN as the natural inference analogue of the sharp threshold phenomenon in random graphs. In contrast with the general theorydeveloped for sharp thresholds of random graph properties, the AoNphenomenon has only been studied so far in specific inference settings, anda general theory behind its appearance remains elusive.In this paper we study the general problem of inferring a graph $H=H_n$ planted in an Erdős–Rényi random graph, thus naturally connecting the two lines of research mentioned above. We show that questions of AoN are closely connected to first moment thresholds, and to a generalization of the so-called Kahn–Kalai expectation threshold that scans over subgraphs of $H$ of edge density at least $q$. In a variety of settings we characterize AoN, by showing that AoN occurs \emph{if and only if} this “generalized expectation threshold” is roughly constant in $q$. Our proofs combine techniques from random graph theory and Bayesian inference.

List Online Classification

Wed, 12 Jul 2023 00:00:00 +0000

We study multiclass online prediction where the learner can predict using a list of multiple labels (as opposed to just one label in the traditional setting). We characterize learnability in this model using the $b$-ary Littlestone dimension. This dimension is a variation of the classical Littlestone dimension with the difference that binary mistake trees are replaced with $(k+1)$-ary mistake trees, where $k$ is the number of labels in the list. In the agnostic setting, we explore different scenarios depending on whether the comparator class consists of single-labeled or multi-labeled functions and its tradeoff with the size of the lists the algorithm uses. We find that it is possible to achieve negative regret in some cases and provide a complete characterization of when this is possible.As part of our work, we adapt classical algorithms such as Littlestone’s SOA and Rosenblatt’s Perceptron to predict using lists of labels. We also establish combinatorial results for list-learnable classes, including an online version of the Sauer-Shelah-Perles Lemma. We state our results within the framework of pattern classes — a generalization of hypothesis classes which can represent adaptive hypotheses (i.e. functions with memory), and model data-dependent assumptions such as linear classification with margin.

Efficient median of means estimator

Wed, 12 Jul 2023 00:00:00 +0000

The goal of this note is to present a modification of the popular median of means estimator that achieves sub-Gaussian deviation bounds with nearly optimal constants under minimal assumptions on the underlying distribution. We build on a recent work on the topic and prove that desired guarantees can be attained under weaker requirements.

Quasi-Newton Steps for Efficient Online Exp-Concave Optimization

Wed, 12 Jul 2023 00:00:00 +0000

The aim of this paper is to design computationally-efficient and optimal algorithms for the online and stochastic exp-concave optimization settings. Typical algorithms for these settings, such as the Online Newton Step (ONS), can guarantee a $O(d\ln T)$ bound on their regret after $T$ rounds, where $d$ is the dimension of the feasible set. However, such algorithms perform so-called \emph{generalized projections} whenever their iterates step outside the feasible set. Such generalized projections require $\Omega(d^3)$ arithmetic operations even for simple sets such a Euclidean ball, making the total runtime of ONS of order $d^3 T$ after $T$ rounds, in the worst-case. In this paper, we side-step generalized projections by using a self-concordant barrier as a regularizer to compute the Newton steps. This ensures that the iterates are always within the feasible set without requiring projections. This approach still requires the computation of the inverse of the Hessian of the barrier at every step. However, using stability properties of the Newton iterates, we show that the inverse of the Hessians can be efficiently approximated via Taylor expansions for most rounds, resulting in a $\widetilde O(d^2 T +d^\omega \sqrt{T})$ total computational complexity, where $\omega\in(2,3]$ is the exponent of matrix multiplication. In the stochastic setting, we show that this translates into a $\widetilde O(d^3/\varepsilon)$ computational complexity for finding an $\varepsilon$-optimal point, answering an open question by Koren 2013. We first prove these new results for the simple case where the feasible set is a Euclidean ball. Then, to move to general convex sets, we use a reduction to Online Convex Optimization over the Euclidean ball. Our final algorithm for general convex sets can be viewed as a more computationally-efficient version of ONS.

Accelerated and Sparse Algorithms for Approximate Personalized PageRank and Beyond

Wed, 12 Jul 2023 00:00:00 +0000

It has recently been shown that ISTA, an unaccelerated optimization method, presents sparse updates for the $\ell_1$-regularized undirected personalized PageRank problem (Fountoulakis et al., 2019), leading to cheap iteration complexity and providing the same guarantees as the approximate personalized PageRank algorithm (\appr{}), (Andersen et al., 2016). In this work, we design an accelerated optimization algorithm for this problem that also performs sparse updates, providing an affirmative answer to the COLT 2022 open question of (Fountoulakis et al., 2022). Acceleration provides a reduced dependence on the condition number, while the dependence on the sparsity in our updates differs from the ISTA approach. Further, we design another algorithm by using conjugate directions to achieve an exact solution while exploiting sparsity. Both algorithms lead to faster convergence for certain parameter regimes. Our findings apply beyond PageRank and work for any quadratic objective whose Hessian is a positive-definite $M$-matrix.

Accelerated Riemannian Optimization: Handling Constraints with a Prox to Bound Geometric Penalties

Wed, 12 Jul 2023 00:00:00 +0000

We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in a wide class of Hadamard manifolds. We achieve the same convergence rates as Nesterov’s accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works \emph{resort to assuming} that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods of limited applicability. For our manifolds, this solves the open question in (Kim and Yang, 2022) about obtaining global general acceleration without iterates assumptively staying in the feasible set.In our solution, we design an accelerated Riemannian inexact proximal point algorithm, which is a result that was unknown even with exact access to the proximal operator, and is of independent interest. For smooth functions, we show we can implement the prox step inexactly with first-order methods in Riemannian balls of certain diameter that is enough for global accelerated optimization.

Detection-Recovery Gap for Planted Dense Cycles

Wed, 12 Jul 2023 00:00:00 +0000

Planted dense cycles are a type of latent structure that appears in many applications, such as small-world networks in social sciences and sequence assembly in computational biology. We consider a model where a dense cycle with expected bandwidth $n \tau$ and edge density $p$ is planted in an Erdős–Rényi graph $G(n,q)$. We characterize the computational thresholds for the associated detection and recovery problems for the class of low-degree polynomial algorithms. In particular, a gap exists between the two thresholds in a certain regime of parameters. For example, if $n^{-3/4} \ll \tau \ll n^{-1/2}$ and $p = C q = \Theta(1)$ for a constant $C>1$, the detection problem is computationally easy while the recovery problem is hard for low-degree algorithms.

Shortest Program Interpolation Learning

Wed, 12 Jul 2023 00:00:00 +0000

We prove that the Minimum Program Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise.

Private Covariance Approximation and Eigenvalue-Gap Bounds for Complex Gaussian Perturbations

Wed, 12 Jul 2023 00:00:00 +0000

We consider the problem of approximating a $d \times d$ covariance matrix $M$ with a rank-$k$ matrix under $(\varepsilon,\delta)$-differential privacy. We present and analyze a complex variant of the Gaussian mechanism and show that the Frobenius norm of the difference between the matrix output by this mechanism and the best rank-$k$ approximation to $M$ is bounded by roughly $\tilde{O}(\sqrt{kd})$, whenever there is an appropriately large gap between the $k$’th and the $k+1$’th eigenvalues of $M$. This improves on previous work that requires that the gap between every pair of top-$k$ eigenvalues of $M$ is at least $\sqrt{d}$ for a similar bound. Our analysis leverages the fact that the eigenvalues of complex matrix Brownian motion repel more than in the real case, and uses Dyson’s stochastic differential equations governing the evolution of its eigenvalues to show that the eigenvalues of the matrix $M$ perturbed by complex Gaussian noise have large gaps with high probability. Our results contribute to the analysis of low-rank approximations under average-case perturbations and to an understanding of eigenvalue gaps for random matrices, which may be of independent interest.

Learning Hidden Markov Models Using Conditional Samples

Wed, 12 Jul 2023 00:00:00 +0000

This paper is concerned with the computational and statistical complexity of learning the Hidden Markov model (HMM). Although HMMs are some of the most widely used tools in sequential and time series modeling, they are cryptographically hard to learn in the standard setting where one has access to i.i.d. samples of observation sequences. In this paper, we depart from this setup and consider an \emph{interactive access model}, in which the algorithm can query for samples from the \emph{conditional distributions} of the HMMs. We show that interactive access to the HMM enables computationally efficient learning algorithms, thereby bypassing cryptographic hardness.Specifically, we obtain efficient algorithms for learning HMMs in two settings: \begin{enumerate} \item An easier setting where we have query access to the exact conditional probabilities. Here our algorithm runs in polynomial time and makes polynomially many queries to approximate any HMM in total variation distance. \item A harder setting where we can only obtain samples from the conditional distributions. Here the performance of the algorithm depends on a new parameter, called the fidelity of the HMM. We show that this captures cryptographically hard instances and all previously known positive results. \end{enumerate} We also show that these results extend to a broader class of distributions with latent low rank structure. Our algorithms can be viewed as generalizations and robustifications of Angluin’s $L^*$ algorithm for learning deterministic finite automata from membership queries.

Asymptotic confidence sets for random linear programs

Wed, 12 Jul 2023 00:00:00 +0000

Motivated by the statistical analysis of the discrete optimal transport problem, we prove distributional limits for the solutions of linear programs with random constraints.Such limits were first obtained by Klatt, Munk, & Zemel (2022), but their expressions for the limits involve a computationally intractable decomposition of $\mathbb{R}^m$ into a possibly exponential number of convex cones.We give a new expression for the limit in terms of auxiliary linear programs, which can be solved in polynomial time.We also leverage tools from random convex geometry to give distributional limits for the entire set of random optimal solutions, when the optimum is not unique.Finally, we describe a simple, data-driven method to construct asymptotically valid confidence sets in polynomial time.

Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise

Wed, 12 Jul 2023 00:00:00 +0000

In this paper, we consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient’s noise is assumed to have bounded $p$th moment ($p\in(1,2]$). This is motivated by a recent plethora of studies in the machine learning literature, which point out that, in comparison to the standard finite-variance assumption, the heavy-tailed noise regime is more appropriate for modern machine learning tasks such as training neural networks. In the heavy-tailed noise regime, Zhang et al. (2020) is the first to prove the $\Omega(T^{\frac{1-p}{3p-2}})$ lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Later, Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, where $\delta$ is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in $p$th moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Later, Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound $\Omega(T^{\frac{1-p}{3p-2}})$ with only a tiny bit of structure, i.e., when the objective function $F(x)$ is assumed to be in the form of $\E_{\Xi\sim\domxi}[f(x,\Xi)]$, arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of $O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$ under a mild condition, which is faster than $\Omega(T^{\frac{1-p}{3p-2}})$. Notably, even when specialized to the standard finite-variance case ($p =2$), our result yields the (near-)optimal high-probability rate $O(\log(T/\delta)T^{-1/3})$, which is unknown before.

Exponential Hardness of Reinforcement Learning with Linear Function Approximation

Wed, 12 Jul 2023 00:00:00 +0000

A fundamental question in reinforcement learning theory is: suppose the optimal value functions are linear in given features, can we learn them efficiently? This problem’s counterpart in supervised learning, linear regression, can be solved both statistically and computationally efficiently. Therefore, it was quite surprising when a recent work \cite{kane2022computational} showed a computational-statistical gap for linear reinforcement learning: even though there are polynomial sample-complexity algorithms, unless NP = RP, there are no polynomial time algorithms for this setting.In this work, we build on their result to show a computational lower bound, which is exponential in feature dimension and horizon, for linear reinforcement learning under the Randomized Exponential Time Hypothesis. To prove this we build a round-based game where in each round the learner is searching for an unknown vector in a unit hypercube. The rewards in this game are chosen such that if the learner achieves large reward, then the learner’s actions can be used to simulate solving a variant of 3-SAT, where (a) each variable shows up in a bounded number of clauses (b) if an instance has no solutions then it also has no solutions that satisfy more than (1-$\epsilon$)-fraction of clauses. We use standard reductions to show this 3-SAT variant is approximately as hard as 3-SAT. Finally, we also show a lower bound optimized for horizon dependence that almost matches the best known upper bound of $\exp(\sqrt{H})$.

Improved Bounds for Multi-task Learning with Trace Norm Regularization

Wed, 12 Jul 2023 00:00:00 +0000

Compared with learning each task independently, multi-task learning (MTL) is able to learn with few training samples and achieves better prediction performance. Recently, Boursier et al. (2022) study the estimation error bound for MTL with trace norm regularizer and a few observations per task. However, their results rely on three assumptions: 1) The features are isotropic; 2) The task diversity assumption is enforced to the parameters matrix; 3) The number of tasks is larger than the features dimension. Whether it is possible to drop these three assumptions and improve the bounds in Boursier et al. (2022) has remained unknown. This paper provides an affirmative answer to this question. Specifically, we reduce their upper bounds from $\tilde{\mathcal{O}}(\sigma \sqrt{\frac{rd^2/m+rT}{m}} + \sqrt{\frac{rd^2/m+rdT/m}{m}})$ to $\mathcal{O}( \sigma\sqrt{\frac{r+rd/T}{m}} )$ without three assumptions, where $T$ is the number of tasks, $d$ is the dimension of the feature space, $m$ is the number of observations per task, $r$ is the rank of ground truth matrix, $\sigma$ is the standard deviation of the noise random variable. Moreover, we provide minimax lower bounds showing our upper bounds are rateoptimal if $T =\mathcal{O}(d)$.

$\ell_p$-Regression in the Arbitrary Partition Model of Communication

Wed, 12 Jul 2023 00:00:00 +0000

We consider the randomized communication complexity of the distributed $\ell_p$-regression problem in the coordinator model, for $p\in (0,2]$. In this problem, there is a coordinator and $s$ servers. The $i$-th server receives $A^i\in\{-M, -M+1, \ldots, M\}^{n\times d}$ and $b^i\in\{-M, -M+1, \ldots, M\}^n$ and the coordinator would like to find a $(1+\eps)$-approximate solution to $\min_{x\in\R^n} \norm{(\sum_i A^i)x - (\sum_i b^i)}_p$. Here $M \leq \poly(nd)$ for convenience. This model, where the data is additively shared across servers, is commonly referred to as the arbitrary partition model. We obtain significantly improved bounds for this problem. For $p = 2$, i.e., least squares regression, we give the first optimal bound of $\tilde{\Theta}(sd^2 + sd/\epsilon)$ bits. For $p \in (1,2)$, we obtain an $\tilde{O}(sd^2/\eps + sd/\poly(\eps))$ upper bound. Notably, for $d$ sufficiently large, our leading order term only depends linearly on $1/\epsilon$ rather than quadratically. We also show communication lower bounds of $\Omega(sd^2 + sd/\eps^2)$ for $p\in (0,1]$ and $\Omega(sd^2 + sd/\eps)$ for $p\in (1,2]$. Our bounds considerably improve previous bounds due to (Woodruff et al. COLT, 2013) and (Vempala et al., SODA, 2020).

Allocating Divisible Resources on Arms with Unknown and Random Rewards

Wed, 12 Jul 2023 00:00:00 +0000

We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order $b$ of the allocated resource. When the order ranges from 0 to 1, the framework smoothly bridges the standard stochastic multi-armed bandit and online learning with full feedback. We design two algorithms that attain the optimal gap-dependent and gap-independent regret bounds for $b\in [0,1]$, and demonstrate a phase transition at $b=1/2$. The theoretical results hinge on a novel concentration inequality we have developed that bounds a linear combination of sub-Gaussian random variables whose weights are fractional, adapted to the filtration, and monotonic.

Stability and Generalization of Stochastic Optimization with Nonconvex and Nonsmooth Problems

Wed, 12 Jul 2023 00:00:00 +0000

Stochastic optimization has found wide applications in minimizing objective functions in machine learning, which motivates a lot of theoretical studies to understand its practical success. Most of existing studies focus on the convergence of optimization errors, while the generalization analysis of stochastic optimization is much lagging behind. This is especially the case for nonconvex and nonsmooth problems often encountered in practice. In this paper, we initialize a systematic stability and generalization analysis of stochastic optimization on nonconvex and nonsmooth problems. We introduce novel algorithmic stability measures and establish their quantitative connection on the gap between population gradients and empirical gradients, which is then further extended to study the gap between the Moreau envelope of the empirical risk and that of the population risk. To our knowledge, these quantitative connection between stability and generalization in terms of either gradients or Moreau envelopes have not been studied in the literature. We introduce a class of sampling-determined algorithms, for which we develop bounds for three stability measures. Finally, we apply these results to derive error bounds for stochastic gradient descent and its adaptive variant, where we show how to achieve an implicit regularization by tuning the step sizes and the number of iterations.

A Lower Bound for Linear and Kernel Regression with Adaptive Covariates

Wed, 12 Jul 2023 00:00:00 +0000

We prove that the continuous time version of the concentration bounds by Abbasi-Yadkori et al. (2011) for adaptive linear regression cannot be improved in general, showing that there can be a significant price for sequential design. This resolves the continuous time version of the COLT open problem by Vakili et al. (2021b) on confidence intervals for kernel regression with sequential designs. Experimental evidence suggests that improved confidence bounds are also not possible in discrete time.

A Second-Order Method for Stochastic Bandit Convex Optimisation

Wed, 12 Jul 2023 00:00:00 +0000

We introduce a simple and efficient algorithm for unconstrained zeroth-order stochastic convex bandits and prove its regret is at most (1 + r/d)[d^1.5 sqrt(n) + d^3] polylog(n, d, r) where n is the horizon, d the dimension and r is the radius of a known ball containing the minimiser of the loss.

Bagging is an Optimal PAC Learner

Wed, 12 Jul 2023 00:00:00 +0000

Determining the optimal sample complexity of PAC learning in the realizable setting was a central open problem in learning theory for decades. Finally, the seminal work by Hanneke (2016) gave an algorithm with a provably optimal sample complexity. His algorithm is based on a careful and structured sub-sampling of the training data and then returning a majority vote among hypotheses trained on each of the sub-samples. While being a very exciting theoretical result, it has not had much impact in practice, in part due to inefficiency, since it constructs a polynomial number of sub-samples of the training data, each of linear size.In this work, we prove the surprising result that the practical and classic heuristic \emph{bagging} (a.k.a. bootstrap aggregation), due to Breiman (1996), is in fact also an optimal PAC learner. Bagging pre-dates Hanneke’s algorithm by twenty years and is taught in most undergraduate machine learning courses. Moreover, we show that it only requires a logarithmic number of sub-samples to reach optimality.

A Pretty Fast Algorithm for Adaptive Private Mean Estimation

Wed, 12 Jul 2023 00:00:00 +0000

We design an $(\varepsilon, \delta)$-differentially private algorithm to estimate the mean of a $d$-variate distribution, with unknown covariance $\Sigma$, that is adaptive to $\Sigma$. To within polylogarithmic factors, the estimator achieves optimal rates of convergence with respect to the induced Mahalanobis norm $||\cdot||_\Sigma$, takes time $\tilde{O}(n d^2)$ to compute, has near linear sample complexity for sub-Gaussian distributions, allows $\Sigma$ to be degenerate or low rank, and adaptively extends beyond sub-Gaussianity. Prior to this work, other methods required exponential computation time or the superlinear scaling $n = \Omega(d^{3/2})$ to achieve non-trivial error with respect to the norm $||\cdot||_\Sigma$.

Is Planted Coloring Easier than Planted Clique?

Wed, 12 Jul 2023 00:00:00 +0000

We study the computational complexity of two related problems: recovering a planted q-coloring in G(n,1/2), and finding efficiently verifiable witnesses of non-q-colorability (a.k.a. refutations) in G(n,1/2). Our main results show hardness for both these problems in a restricted-but-powerful class of algorithms based on computing low-degree polynomials in the inputs.The problem of recovering a planted q-coloring is equivalent to recovering q disjoint planted cliques that cover all the vertices — a potentially easier variant of the well-studied planted clique problem. Our first result shows that this variant is as hard as the original planted clique problem in the low-degree polynomial model of computation: each clique needs to have size k >> sqrt(n) for efficient recovery to be possible. For the related variant where the cliques cover a (1-epsilon)-fraction of the vertices, we also show hardness by reduction from planted clique.Our second result shows that refuting q-colorability of G(n,1/2) is hard in the low-degree polynomial model when q >> n^{2/3} but easy when q << n^{1/2}, and we leave closing this gap for future work. Our proof is more subtle than similar results for planted clique and involves constructing a non-standard distribution over q-colorable graphs. We note that while related to several prior works, this is the first work that explicitly formulates refutation problems in the low-degree polynomial model.The proofs of our main results involve showing low-degree hardness of hypothesis testing between an appropriately constructed pair of distributions. For refutation, we show completeness of this approach: in the low-degree model, the refutation task is precisely as hard as the hardest associated testing problem, i.e., proving hardness of refutation amounts to finding a "hard" distribution.

Condition-number-independent Convergence Rate of Riemannian Hamiltonian Monte Carlo with Numerical Integrators

Wed, 12 Jul 2023 00:00:00 +0000

We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\R^{n}$. We show that for distributions in the form of $e^{-\alpha^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert \alpha\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $\epsilon$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of \cite{kook2022sampling}, which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice.

Best-of-three-worlds Analysis for Linear Bandits with Follow-the-regularized-leader Algorithm

Wed, 12 Jul 2023 00:00:00 +0000

The linear bandit problem has been studied for many years in both stochastic and adversarial settings. Designing an algorithm that can optimize the environment without knowing the loss type attracts lots of interest. \citet{LeeLWZ021} propose an algorithm that actively detects the loss type and then switches between different algorithms specially designed for specific settings. However, such an approach requires meticulous designs to perform well in all environments. Follow-the-regularized-leader (FTRL) is another type of popular algorithm that can adapt to different environments. This algorithm is of simple design and the regret bounds are shown to be optimal in traditional multi-armed bandit problems compared with the detect-switch type. Designing an FTRL-type algorithm for linear bandits is an important question that has been open for a long time. In this paper, we prove that the FTRL algorithm with a negative entropy regularizer can achieve the best-of-three-world results for the linear bandit problem. Our regret bounds achieve the same or nearly the same order as the previous detect-switch type algorithm but with a much simpler algorithmic design.

U-Calibration: Forecasting for an Unknown Agent

Wed, 12 Jul 2023 00:00:00 +0000

We consider the problem of evaluating forecasts of binary events whose predictions are consumed by rational agents who take an action in response to a prediction, but whose utility is unknown to the forecaster. We show that optimizing forecasts for a single scoring rule (e.g., the Brier score) cannot guarantee low regret for all possible agents. In contrast, forecasts that are well-calibrated guarantee that all agents incur sublinear regret. However, calibration is not a necessary criterion here (it is possible for miscalibrated forecasts to provide good regret guarantees for all possible agents), and calibrated forecasting procedures have provably worse convergence rates than forecasting procedures targeting a single scoring rule.Motivated by this, we present a new metric for evaluating forecasts that we call U-calibration, equal to the maximal regret of the sequence of forecasts when evaluated under any bounded scoring rule. We show that sublinear U-calibration error is a necessary and sufficient condition for all agents to achieve sublinear regret guarantees. We additionally demonstrate how to compute the U-calibration error efficiently and provide an online algorithm that achieves $O(\sqrt{T})$ U-calibration error (on par with optimal rates for optimizing for a single scoring rule, and bypassing lower bounds for the traditionally calibrated learning procedures). Finally, we discuss generalizations to the multiclass prediction setting.

Semi-Random Sparse Recovery in Nearly-Linear Time

Wed, 12 Jul 2023 00:00:00 +0000

Sparse recovery is one of the most fundamental and well-studied inverse problems.Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearly-linear time) iterative methods. However, these latter “fast algorithms” have previously been observed to be brittle in various real-world settings.We investigate the brittleness of fast sparse recovery algorithms to generative model changes through the lens of studying their robustness to a “helpful” semi-random adversary, a framework for testing overfitting to input assumptions. We consider the following basic model: let $\mathbf{A} \in \mathbb{R}^{n \times d}$ be a measurement matrix containing an unknown subset of rows $\mathbf{G} \in \mathb{R}^{m \times d}$ which are bounded and satisfy the restricted isometry property (RIP), but is otherwise arbitrary. Letting $x^\star \in \mathbb{R}^d$ be $s$-sparse, and given either exact or noisy measurements, $b = \mathbf{A} x^\star$ or $b = \mathbf{A} x^\star + \xi$, we design algorithms recovering $x^\star$ information-theoretically optimally in nearly-linear time. We extend our algorithm to hold for weaker generative models relaxing our planted RIP row subset assumption to a natural weighted variant, and show that our method’s guarantees naturally interpolate the quality of the measurement matrix to, in some parameter regimes, run in sublinear time.Our approach differs from that of prior fast iterative methods with provable guarantees under semi-random generative models [CG18, LSTZ20], which typically separate the problem of learning the planted instance from the estimation problem, i.e. they attempt to first learn the planted “good” instance (in our case, the matrix $\mathbf{G}$). However, natural conditions on a submatrix which make sparse recovery tractable, such as RIP, are NP-hard to verify and hence first learning a sufficient row reweighting appears challenging. We eschew this approach and design a new iterative method, tailored to the geometry of sparse recovery, which is provably robust to our semi-random model. Our hope is that our approach opens the door to new robust, efficient algorithms for other natural statistical inverse problems.

A Nearly Tight Bound for Fitting an Ellipsoid to Gaussian Random Points

Wed, 12 Jul 2023 00:00:00 +0000

We prove that for $c>0$ a sufficiently small universal constant that a random set of $c d^2/\log^4(d)$ independent Gaussian random points in $\R^d$ lie on a common ellipsoid with high probability. This nearly establishes a conjecture of \citet{SaundersonCPW12}, within logarithmic factors.The latter conjecture has attracted significant attention over the past decade, dueto its connections to machine learning and sum-of-squares lower bounds for certain statistical problems.

Learning and Testing Latent-Tree Ising Models Efficiently

Wed, 12 Jul 2023 00:00:00 +0000

We provide time- and sample-efficient algorithms for learning and testing latent-tree Ising models, i.e. Ising models that may only be observed at their leaf nodes. On the learning side, we obtain efficient algorithms for learning a tree-structured Ising model whose leaf node distribution is close in Total Variation Distance, improving on the results of \cite{cryan2001evolutionary}. On the testing side, we provide an efficient algorithm with fewer samples for testing whether two latent-tree Ising models have leaf-node distributions that are close or far in Total Variation distance. We obtain our algorithms by showing novel localization results for the total variation distance between the leaf-node distributions of tree-structured Ising models, in terms of their marginals on pairs of leaves.

Deterministic Nonsmooth Nonconvex Optimization

Wed, 12 Jul 2023 00:00:00 +0000

We study the complexity of optimizing nonsmooth nonconvex Lipschitz functions by producing $(\delta,\epsilon)$-Goldstein stationary points. Several recent works have presented randomized algorithms that produce such points using $\widetilde{O}(\delta^{-1}\epsilon^{-3})$ first-order oracle calls, independent of the dimension $d$. It has been an open problem as to whether a similar result can be obtained via a deterministic algorithm. We resolve this open problem, showing that randomization is necessary to obtain a dimension-free rate. In particular, we prove a lower bound of $\Omega(d)$ for any deterministic algorithm. Moreover, we show that unlike smooth or convex optimization, access to function values is required for any deterministic algorithm to halt within any finite time horizon.On the other hand, we prove that if the function is even slightly smooth, then the dimension-free rate of $\widetilde{O}(\delta^{-1}\epsilon^{-3})$ can be obtained by a deterministic algorithm with merely a logarithmic dependence on the smoothness parameter. Motivated by these findings, we turn to study the complexity of deterministically smoothing Lipschitz functions. Though there are well-known efficient black-box randomized smoothings, we start by showing that no such deterministic procedure can smooth functions in a meaningful manner (suitably defined), resolving an open question in the literature. We then bypass this impossibility result for the structured case of ReLU neural networks. To that end, in a practical “white-box” setting in which the optimizer is granted access to the network’s architecture, we propose a simple, dimension-free, deterministic smoothing of ReLU networks that provably preserves $(\delta,\epsilon)$-Goldstein stationary points. Our method applies to a variety of architectures of arbitrary depth, including ResNets and ConvNets.Combined with our algorithm for slightly-smooth functions, this yields the first deterministic, dimension-free algorithm for optimizing ReLU networks, circumventing our lower bound.

Moments, Random Walks, and Limits for Spectrum Approximation

Wed, 12 Jul 2023 00:00:00 +0000

We study lower bounds for the problem of approximating a one dimensional distribution given (noisy) measurements of its moments. We show that there are distributions on $[-1,1]$ that cannot be approximated to accuracy $\epsilon$ in Wasserstein-1 distance even if we know \emph{all} of their moments to multiplicative accuracy $(1\pm2^{-\Omega(1/\epsilon)})$; this result matches an upper bound of Kong and Valiant [Annals of Statistics, 2017]. To obtain our result, we provide a hard instance involving distributions induced by the eigenvalue spectra of carefully constructed graph adjacency matrices. Efficiently approximating such spectra in Wasserstein-1 distance is a well-studied algorithmic problem, and a recent result of Cohen-Steiner et al. [KDD 2018] gives a method based on accurately approximating spectral moments using $2^{O(1/\epsilon)}$ random walks initiated at uniformly random nodes in the graph.As a strengthening of our main result, we show that improving the dependence on $1/\epsilon$ in this result would require a new algorithmic approach. Specifically, no algorithm can compute an $\epsilon$-accurate approximation to the spectrum of a normalized graph adjacency matrix with constant probability, even when given the transcript of $2^{\Omega(1/\epsilon)}$ random walks of length $2^{\Omega(1/\epsilon)}$ started at random nodes.

Online Learning Guided Curvature Approximation: A Quasi-Newton Method with Global Non-Asymptotic Superlinear Convergence

Wed, 12 Jul 2023 00:00:00 +0000

Quasi-Newton algorithms are among the most popular iterative methods for solving unconstrained minimization problems, largely due to their favorable superlinear convergence property. However, existing results for these algorithms are limited as they provide either (i) a global convergence guarantee with an asymptotic superlinear convergence rate, or (ii) a local non-asymptotic superlinear rate for the case that the initial point and the initial Hessian approximation are chosen properly. In particular, no current analysis for quasi-Newton methods guarantees global convergence with an explicit superlinear convergence rate. In this paper, we close this gap and present the first globally convergent quasi-Newton method with an explicit non-asymptotic superlinear convergence rate. Unlike classical quasi-Newton methods, we build our algorithm upon the hybrid proximal extragradient method and propose a novel online learning framework for updating the Hessian approximation matrices. Specifically, guided by the convergence analysis, we formulate the Hessian approximation update as an online convex optimization problem in the space of matrices, and we relate the bounded regret of the online problem to the superlinear convergence of our method.

Entropic characterization of optimal rates for learning Gaussian mixtures

Wed, 12 Jul 2023 00:00:00 +0000

We consider the question of estimating multi-dimensional Gaussian mixtures (GM) with com- pactly supported or subgaussian mixing distributions. Minimax estimation rate for this class (under Hellinger, TV and KL divergences) is a long-standing open question, even for dimension one. In this paper we characterize this rate (in all dimensions) in terms of the metric entropy of the class. Such characterizations originate from seminal works of Le Cam (1973); Birge ́ (1983); Haussler and Opper (1997); Yang and Barron (1999). However, for GMs a key ingredient missing from earlier work (and widely sought-after) is a comparison result showing that the KL and the squared Hellinger distance are within a constant multiple of each other uniformly over the class. Our main technical contribution is in showing this fact, from which we derive entropy characterization for estimation rate under Hellinger and KL. Interestingly, the sequential (online learning) estimation rate is characterized by the global entropy, while the single-step (batch) rate corresponds to local entropy, paralleling a similar recent discovery for the case of Gaussian sequence model in a pair of works Neykov (2022); Mourtada (2023). Additionally, since Hellinger is a proper metric, our comparison shows that GMs under KL satisfy a version of triangle inequality (with a multiplicative constant), implying that proper and improper estimation rates coincide.

Tighter PAC-Bayes Bounds Through Coin-Betting

Wed, 12 Jul 2023 00:00:00 +0000

We consider the problem of estimating the mean of a sequence of random elements $f(\theta, X_1)$ $, \ldots, $ $f(\theta, X_n)$ where $f$ is a fixed scalar function, $S=(X_1, \ldots, X_n)$ are independent random variables, and $\theta$ is a possibly $S$-dependent parameter. An example of such a problem would be to estimate the generalization error of a neural network trained on $n$ examples where $f$ is a loss function. Classically, this problem is approached through concentration inequalities holding uniformly over compact parameter sets of functions $f$, for example as in Rademacher or VC type analysis. However, in many problems, such inequalities often yield numerically vacuous estimates. Recently, the \emph{PAC-Bayes} framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve \emph{even tighter} guarantees. Our approach is based on the \emph{coin-betting} framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms. In particular, we derive the first PAC-Bayes concentration inequality based on the coin-betting approach that holds simultaneously for all sample sizes. We demonstrate its tightness showing that by \emph{relaxing} it we obtain a number of previous results in a closed form including Bernoulli-KL and empirical Bernstein inequalities. Finally, we propose an efficient algorithm to numerically calculate confidence sequences from our bound, which often generates nonvacuous confidence bounds even with one sample, unlike the state-of-the-art PAC-Bayes bounds.

Empirical Bayes via ERM and Rademacher complexities: the Poisson model

Wed, 12 Jul 2023 00:00:00 +0000

We consider the problem of empirical Bayes estimation for (multivariate) Poisson means. Existing solutions that have been shown theoretically optimal for minimizing the regret (excess risk over the Bayesian oracle that knows the prior) have several shortcomings. For example, the classical Robbins estimator does not retain the monotonicity property of the Bayes estimator and performs poorly under moderate sample size. Estimators based on the minimum distance and non-parametric maximum likelihood (NPMLE) methods correct these issues, but are computationally expensive with complexity growing exponentially with dimension. Extending the approach of Barbehenn andZhao (2022), in this work we construct monotone estimators based on empirical risk minimization (ERM) that retain similar theoretical guarantees and can be computed much more efficiently. Adapting the idea of offset Rademacher complexity Liang et al. (2015) to the non-standard loss and function class in empirical Bayes, we show that the shape-constrained ERM estimator attains the minimax regret within constant factors in one dimension and within logarithmic factors in multiple dimensions.

Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds

Wed, 12 Jul 2023 00:00:00 +0000

This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. More precisely, it achieves best-of-three-worlds regret bounds, i.e., of ${O}(\sqrt{T \log T})$ for adversarial environments and of $O(\frac{\log T}{\Delta_{\min}} + \sqrt{\frac{C \log T}{\Delta_{\min}}})$ for stochastic environments with adversarial corruptions, where $T$, $\Delta_{\min}$, and $C$ denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption. Note that polynomial factors in the dimensionality are omitted here. At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better. The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action, the total quadratic variation, and the path-length of the loss vector sequence. In addition, for stochastic environments, the proposed algorithm has a variance-adaptive regret bound of $O(\frac{\sigma^2 \log T}{\Delta_{\min}})$ as well, where $\sigma^2$ denotes the maximum variance of the feedback loss. The proposed algorithm is based on the \texttt{SCRiBLe} algorithm (Abernethy et al., 2012). By incorporating into this a new technique we call \textit{scaled-up sampling}, we obtain high-level adaptability, and by incorporating the technique of optimistic online learning, we obtain low-level adaptability.

Generalization Error Bounds for Noisy, Iterative Algorithms via Maximal Leakage

Wed, 12 Jul 2023 00:00:00 +0000

We adopt an information-theoretic framework to analyze the generalization behavior of the class of iterative, noisy learning algorithms. This class is particularly suitable for study under information-theoretic metrics as the algorithms are inherently randomized, and it includes commonly used algorithms such as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the maximal leakage (equivalently, the Sibson mutual information of order infinity) metric, as it is simple to analyze, and it implies both bounds on the probability of having a large generalization error and on its expected value. We show that, if the update function (e.g., gradient) is bounded in $L_2$-norm and the additive noise is isotropic Gaussian noise, then one can obtain an upper-bound on maximal leakage in semi-closed form. Furthermore, we demonstrate how the assumptions on the update function affect the optimal (in the sense of minimizing the induced maximal leakage) choice of the noise. Finally, we compute explicit tight upper bounds on the induced maximal leakage for other scenarios of interest.

A High-dimensional Convergence Theorem for U-statistics with Applications to Kernel-based Testing

Wed, 12 Jul 2023 00:00:00 +0000

We prove a convergence theorem for U-statistics of degree two, where the data dimension $d$ is allowed to scale with sample size $n$. We find that the limiting distribution of a U-statistic undergoes a phase transition from the non-degenerate Gaussian limit to the degenerate limit, regardless of its degeneracy and depending only on a moment ratio. A surprising consequence is that a non-degenerate U-statistic in high dimensions can have a non-Gaussian limit with a larger variance and asymmetric distribution. Our bounds are valid for any finite $n$ and $d$, independent of individual eigenvalues of the underlying function, and dimension-independent under a mild assumption. As an application, we apply our theory to two popular kernel-based distribution tests, MMD and KSD, whose high-dimensional performance has been challenging to study. In a simple empirical setting, our results correctly predict how the test power at a fixed threshold scales with $d$ and the bandwidth.

Reaching Kesten-Stigum Threshold in the Stochastic Block Model under Node Corruptions

Wed, 12 Jul 2023 00:00:00 +0000

We study robust community detection in the context of node-corrupted stochastic block model, where an adversary can arbitrarily modify all the edges incident to a fraction of the n vertices. We present the first polynomial-time algorithm that achieves weak recovery at the Kesten-Stigum threshold even in the presence of a small constant fraction of corrupted nodes. Prior to this work, even state-of-the-art robust algorithms were known to break under such node corruption adversaries, when close to the Kesten-Stigum threshold.We further extend our techniques to the $Z_2$ synchronization problem, where our algorithm reaches the optimal recovery threshold in the presence of similar strong adversarial perturbations.The key ingredient of our algorithm is a novel identifiability proof that leverages the push-out effect of the Grothendieck norm of principal submatrices.

Minimizing Dynamic Regret on Geodesic Metric Spaces

Wed, 12 Jul 2023 00:00:00 +0000

In this paper, we consider the sequential decision problem where the goal is to minimize the general dynamic regret on a complete Riemannian manifold. The task of offline optimization on such a domain, also known as a geodesic metric space, has recently received significant attention. The online setting has received significantly less attention, and it has remained an open question whether the body of results that hold in the Euclidean setting can be transplanted into the land of Riemannian manifolds where new challenges (e.g., curvature) come into play. In this paper, we show how to get optimistic regret bound on manifolds with non-positive curvature whenever improper learning is allowed and propose an array of adaptive no-regret algorithms. To the best of our knowledge, this is the first work that considers general dynamic regret and develops “optimistic” online learning algorithms which can be employed on geodesic metric spaces.

The Computational Complexity of Finding Stationary Points in Non-Convex Optimization

Wed, 12 Jul 2023 00:00:00 +0000

Finding approximate stationary points, i.e., points where the gradient is approximately zero, of non-convex but smooth objective functions $f$ over unrestricted $d$-dimensional domains is one of the most fundamental problems in classical non-convex optimization. Nevertheless, the computational and query complexity of this problem are still not well understood when the dimension $d$ of the problem is independent of the approximation error. In this paper, we show the following computational and query complexity results:1.The problem of finding approximate stationary points over unrestricted domains is PLS-complete.2. For $d = 2$, we provide a zero-order algorithm for finding $\varepsilon$-approximate stationary points that requires at most $O(1/\varepsilon)$ value queries to the objective function.3. We show that any algorithm needs at least $\Omega(1/\varepsilon)$ queries to the objective function and/or its gradient to find $\varepsilon$-approximate stationary points when $d=2$. Combined with the above, this characterizes the query complexity of this problem to be $\Theta(1/\varepsilon)$.4. For $d = 2$, we provide a zero-order algorithm for finding $\varepsilon$-KKT points in constrained optimization problems that requires at most $O(1/\sqrt{\varepsilon})$ value queries to the objective function. This closes the gap between the works of Bubeck and Mikulincer (2020) and Vavasis (1993) and characterizes the query complexity of this problem to be $\Theta(1/\sqrt{\varepsilon})$.5. We show that finding approximate KKT points in constrained optimization is reducible to finding approximate stationary points in unconstrained optimization but the converse is impossible.

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

Wed, 12 Jul 2023 00:00:00 +0000

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions). On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.

Algorithmically Effective Differentially Private Synthetic Data

Wed, 12 Jul 2023 00:00:00 +0000

We present a highly effective algorithmic approach for generating $\varepsilon$-differentially private synthetic data in a bounded metric space with near-optimal utility guarantees under the 1-Wasserstein distance. In particular, for a dataset $\mathcal X$ in the hypercube $[0,1]^d$, our algorithm generates synthetic dataset $\mathcal Y$ such that the expected 1-Wasserstein distance between the empirical measure of $\mathcal X$ and $\mathcal Y$ is $O((\varepsilon n)^{-1/d})$ for $d\geq 2$, and is $O(\log^2(\varepsilon n)(\varepsilon n)^{-1})$ for $d=1$. The accuracy guarantee is optimal up to a constant factor for $d\geq 2$, and up to a logarithmic factor for $d=1$. Our algorithm has a fast running time of $O(\varepsilon d n)$ for all $d\geq 1$ and demonstrates improved accuracy compared to the method in Boedihardjo et al. (2022) for $d\geq 2$.

Optimal Scoring Rules for Multi-dimensional Effort

Wed, 12 Jul 2023 00:00:00 +0000

This paper develops a framework for the design of scoring rules to optimally incentivize an agent to exert a multi-dimensional effort. This framework is a generalization to strategic agents of the classical knapsack problem (cf. Briest, Krysta, and Vocking, 2005; Singer, 2010) and it is foundational to applying algorithmic mechanism design to the classroom. The paper identifies two simple families of scoring rules that guarantee constant approximations to the optimal scoring rule. The truncated separate scoring rule is the sum of single dimensional scoring rules that is truncated to the bounded range of feasible scores. The threshold scoring rule gives the maximum score if reports exceed a threshold and zero otherwise. Approximate optimality of one or the other of these rules is similar to the bundling or selling separately result of Babaioff, Immorlica, Lucier, and Weinberg (2014). Finally, we show that the approximate optimality of the best of those two simple scoring rules is robust when the agent’s choice of effort is made sequentially.

Bandit Learnability can be Undecidable

Wed, 12 Jul 2023 00:00:00 +0000

We initiate a general investigation into structured bandits. Specifically, for an abstract space $X$, we suppose a true reward function $f$ resides in a known, but arbitrary, function class $F$. The algorithm may then pull a number of arms $x$ (i.e., query for the value $f(x)$), and thereby attempts to identify an arm $\hat{x}$ of near-maximum reward: $f(\hat{x}) \geq \sup_x f(x) - \epsilon$. While special cases of this problem are well understood in the literature, our interest is in the possibility of a fully-general theory of bandit learnability, analogous to the PAC model for classification: that is, a theory which precisely characterizes which function classes $F$ admit a learning algorithm guaranteed to identify a near-optimal arm within a bounded number of pulls.Our main result in this regard is an illuminating impossibility result. Namely, there exist well-defined function classes $F$ such that bandit learnability is \emph{undecidable} within ZFC set theory. While such undecidability results have previously been shown for a certain abstractly-defined learning problem known as EMX, this is the first example of a natural or commonly-encountered learning problem (i.e., bandits) for which learnability can be provably undecidable. Our proof is based on establishing a (rather-sophisticated) equivalence between certain subfamilies of EMX learning problems and corresponding constructed bandit problems.Despite this general undecidability result, we also establish new general results in special cases. Specifically, we characterize the optimal query complexity in the special case of binary-valued reward functions in terms of a combinatorial complexity measure related to the teaching dimension. We also present an extension to general bounded real-valued rewards, though in this case the upper bound is not always optimal. We instantiate the new complexity measures for several important families of function classes $F$.

Limits of Model Selection under Transfer Learning

Wed, 12 Jul 2023 00:00:00 +0000

Theoretical studies on \emph{transfer learning} (or \emph{domain adaptation}) have so far focused on situations with a known hypothesis class or \emph{model}; however in practice, some amount of model selection is usually involved, often appearing under the umbrella term or \emph{hyperparameter-tuning}: for example, one may think of the problem of \emph{tuning} for the right neural network architecture towards a target task, while leveraging data from a related \emph{source} task. Now, in addition to the usual tradeoffs on approximation vs estimation errors involved in model selection, this problem brings in a new complexity term, namely, the \emph{transfer distance} between source and target distributions, which is known to vary with the choice of hypothesis class. We present a first study of this problem, focusing on classification; in particular, the analysis reveals some remarkable phenomena: \emph{adaptive rates}, i.e., those achievable with no distributional information, can be arbitrarily slower than \emph{oracle rates}, i.e., when given knowledge on \emph{distances}

Multiclass Online Learning and Uniform Convergence

Wed, 12 Jul 2023 00:00:00 +0000

We study multiclass classification in the agnostic adversarial online learning setting. As our main result, we prove that any multiclass concept class is agnostically learnable if and only if its Littlestone dimension is finite. This solves an open problem studied by Daniely, Sabato, Ben-David, and Shalev-Shwartz (2011,2015) who handled the case when the number of classes (or labels) is bounded. We also prove a separation between online learnability and online uniform convergence by exhibiting an easy-to-learn class whose sequential Rademacher complexity is unbounded.Our learning algorithm uses the multiplicative weights algorithm, with a set of experts defined by executions of the Standard Optimal Algorithm on subsequences of size Littlestone dimension. We argue that the best expert has regret at most Littlestone dimension relative to the best concept in the class. This differs from the well-known covering technique of Ben-David, Pal, and Shalev-Shwartz (2009) for binary classification, where the best expert has regret zero.

Universal Rates for Multiclass Learning

Wed, 12 Jul 2023 00:00:00 +0000

We study universal rates for multiclass classification, establishing the optimal rates (up to log factors) for all hypothesis classes. This generalizes previous results on binary classification (Bousquet, Hanneke, Moran, van Handel, and Yehudayoff, 2021), and resolves an open question studied by Kalavasis, Velegkas, and Karbasi (2022) who handled the multiclass setting with a bounded number of class labels. In contrast, our result applies for any countable label space. Even for finite label space, our proofs provide a more precise bounds on the learning curves, as they do not depend on the number of labels. Specifically, we show that any class admits exponential rates if and only if it has no infinite Littlestone tree, and admits (near-)linear rates if and only if it has no infinite Daniely-Shalev-Shwartz-Littleston (DSL) tree, and otherwise requires arbitrarily slow rates. DSL trees are a new structure we define in this work, in which each node of the tree is given by a pseudo-cube of possible classifications of a given set of points. Pseudo-cubes are a structure, rooted in the work of Daniely and Shalev-Shwartz (2014), and recently shown by Brukhim, Carmon, Dinur, Moran, and Yehudayoff (2022) to characterize PAC learnability (i.e., uniform rates) for multiclass classification. We also resolve an open question of Kalavasis, Velegkas, and Karbasi (2022) regarding the equivalence of classes having infinite Graph-Littlestone (GL) trees versus infinite Natarajan-Littlestone (NL) trees, showing that they are indeed equivalent.

Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms

Wed, 12 Jul 2023 00:00:00 +0000

In this paper, we address the stochastic contextual linear bandit problem, where a decision maker is provided a context (a random set of actions drawn from a distribution). The expected reward of each action is specified by the inner product of the action and an unknown parameter. The goal is to design an algorithm that learns to play as close as possible to the unknown optimal policy after a number of action plays. This problem is considered more challenging than the linear bandit problem, which can be viewed as a contextual bandit problem with a \emph{fixed} context. Surprisingly, in this paper, we show that the stochastic contextual problem can be solved as if it is a linear bandit problem. In particular, we establish a novel reduction framework that converts every stochastic contextual linear bandit instance to a linear bandit instance, when the context distribution is known. When the context distribution is unknown, we establish an algorithm that reduces the stochastic contextual instance to a sequence of linear bandit instances with small misspecifications and achieves nearly the same worst-case regret bound as the algorithm that solves the misspecified linear bandit instances. As a consequence, our results imply a $O(d\sqrt{T\log T})$ high-probability regret bound for contextual linear bandits, making progress in resolving an open problem in Li et al., 2019b, 2021. Our reduction framework opens up a new way to approach stochastic contextual linear bandit problems, and enables improved regret bounds in a number of instances including the batch setting, contextual bandits with misspecifications, contextual bandits with sparse unknown parameters, and contextual bandits with adversarial corruption.

Finite-Sample Symmetric Mean Estimation with Fisher Information Rate

Wed, 12 Jul 2023 00:00:00 +0000

The mean of an unknown variance-$\sigma^2$ distribution $f$ can be estimated from $n$ samples with variance $\frac{\sigma^2}{n}$ and nearly corresponding subgaussian rate. When $f$ is known up to translation, this can be improved asymptotically to $\frac{1}{nI}$, where $I$ is the Fisher information of the distribution. Such an improvement is not possible for general unknown $f$, but [Stone 1975] showed that this asymptotic convergence \emph{is} possible if $f$ is _symmetric_ about its mean. Stone’s bound is asymptotic, however: the $n$ required for convergence depends in an unspecified way on the distribution $f$ and failure probability $\delta$. In this paper we give finite-sample guarantees for symmetric mean estimation in terms of Fisher information. For every $f, n, \delta$ with $n > \log \frac{1}{\delta}$, we get convergence close to a subgaussian with variance $\frac{1}{n I_r}$, where $I_r$ is the $r$-smoothed Fisher information with smoothing radius $r$ that decays polynomially in $n$. Such a bound essentially matches the finite-sample guarantees in the known-$f$ setting.

Online Nonconvex Optimization with Limited Instantaneous Oracle Feedback

Wed, 12 Jul 2023 00:00:00 +0000

We investigate online nonconvex optimization from a local regret minimization perspective. Previous studies along this line implicitly required the access to sufficient gradient oracles at each time instance in order to design double-loop algorithms. In this work, we focus on more challenging but practical settings where only limited number of oracles are available in online nonconvex optimization, including window-smoothed single gradient oracle (Window-SGO), single function value oracle (Window-SVO) and multiple function value oracles (Window-MVO). Specifically, in the Window-SGO setting which allows only single-loop algorithm design, we derive a local regret lower bound, which indicates that single-loop algorithms are provably worse than double-loop algorithms. Further, the simple classical OGD algorithm achieves the window-unconditioned lower bound. Moreover, in the Window-SVO setting, we propose a novel single-loop online algorithm named SkipOGD, and show that it achieves a near-optimal local regret that matches the Window-SGO regret lower bound up to a factor of the dimension $d$ due to the function value feedback. Lastly, in the Window-MVO setting, we propose a new double-loop online algorithm named LoopOGD and show that it achieves a smooth trade-off between regret minimization and sample complexity over the number of oracle calls $K$ per time instance. In particular, with $K=1$ and $wd$, LoopOGD respectively achieves our regret lower bound with Window-SGO (up to the factor $d$ due to function value feedback) and the existing regret lower bound with multiple gradient oracle feedback.

Weak Recovery Threshold for the Hypergraph Stochastic Block Model

Wed, 12 Jul 2023 00:00:00 +0000

We study the weak recovery problem on the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In HSBM a random graph is constructed by placing hyperedges with higher density if all vertices of a hyperedge share the same binary label, and weak recovery asks to recover a non-trivial fraction of the labels. We introduce a multi-terminal version of strong data processing inequalities (SDPIs), which we call the multi-terminal SDPI, and use it to prove a variety of impossibility results for weak recovery. In particular, we prove that weak recovery is impossible below the Kesten-Stigum (KS) threshold if $r=3,4$, or a strength parameter $\lambda$ is at least $\frac 15$. Prior work Pal and Zhu (2021) established that weak recovery in HSBM is always possible above the KS threshold. Consequently, there is no information-computation gap for these cases, which (partially) resolves a conjecture of Angelini et al. (2015). To our knowledge this is the first impossibility result for HSBM weak recovery.As usual, we reduce the study of non-recovery of HSBM to the study of non-reconstruction in a related broadcasting on hypertrees (BOHT) model. While we show that BOHT’s reconstruction threshold coincides with KS for $r=3,4$, surprisingly, we demonstrate that for $r\ge 7$ reconstruction is possible also below KS. This shows an interesting phase transition in the parameter $r$, and suggests that for $r\ge 7$, there might be an information-computation gap for the HSBM. For $r=5,6$ and large degree we propose an approach for showing non-reconstruction below KS, suggesting that $r=7$ is the correct threshold for onset of the new phase.

Uniqueness of BP fixed point for the Potts model and applications to community detection

Wed, 12 Jul 2023 00:00:00 +0000

In the study of sparse stochastic block models (SBMs) one often needs to analyze a distributional recursion, known as the belief propagation (BP) recursion. Uniqueness of the fixed point of this recursion implies several results about the SBM, including optimal recovery algorithms for SBM (Mossel et al. (2016)) and SBM with side information (Mossel and Xu (2016)), and a formula for SBM mutual information (Abbe et al. (2021)). The 2-community case corresponds to an Ising model, for which Yu and Polyanskiy (2022) established uniqueness for all cases.In this paper we analyze the $q$-ary Potts model, i.e., broadcasting of $q$-ary spins on a Galton-Watson tree with expected offspring degree $d$ through Potts channels with second-largest eigenvalue $\lambda$. We allow the intermediate vertices to be observed through noisy channels (side information). We prove that BP uniqueness holds with and without side information when $d\lambda^2 \ge 1 + C \max\{\lambda, q^{-1}\}\log q$ for some absolute constant $C>0$ independent of $q,\lambda,d$. For large $q$ and $\lambda = o(1/\log q)$, this is asymptotically achieving the Kesten-Stigum threshold $d\lambda^2=1$. These results imply mutual information formulas and optimal recovery algorithms for the $q$-community SBM in the corresponding ranges.For $q\ge 4$, Sly (2011); Mossel et al. (2022) showed that there exist choices of $q,\lambda,d$ below Kesten-Stigum (i.e. $d\lambda^2 < 1$) but reconstruction is possible. Somewhat surprisingly, we show that in such regimes BP uniqueness does not hold at least in the presence of weak side information.Our technical tool is a theory of $q$-ary symmetric channels, that we initiate here, generalizing the classical and widely-utilized information-theoretic characterization of BMS (binary memoryless symmetric) channels.

Non-asymptotic convergence bounds for Sinkhorn iterates and their gradients: a coupling approach.

Wed, 12 Jul 2023 00:00:00 +0000

Computational optimal transport (OT) has recently emerged as a powerful framework with applications in various fields. In this paper we focus on a relaxation of the original OT problem, the entropic OT problem, which allows to implement efficient and practical algorithmic solutions, even in high dimensional settings. This formulation, also known as the Schrödinger Bridge problem, notably connects with Stochastic Optimal Control (SOC) and can be solved with the popular Sinkhorn algorithm. In the case of discrete-state spaces, this algorithm is known to have exponential convergence; however, achieving a similar rate of convergence in a more general setting is still an active area of research. In this work, we analyze the convergence of the Sinkhorn algorithm for probability measures defined on the d-dimensional torus T, that admit densities with respect to the Haar measure of T. In particular, we prove pointwise exponential convergence of Sinkhorn iterates and their gradient. Our proof relies on the connection between these iterates and the evolution along the Hamilton-Jacobi-Bellman equations of value functions obtained from SOC-problems. Our approach is novel in that it is purely probabilistic and relies on coupling by reflection techniques for controlled diffusions on the torus.

Algorithmic Aspects of the Log-Laplace Transform and a Non-Euclidean Proximal Sampler

Wed, 12 Jul 2023 00:00:00 +0000

The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexity-smoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warm-started sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten-$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining state-of-the-art excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proof-of-concept of its utility as a tool for designing samplers, and outline directions for future exploration.

The Expressive Power of Tuning Only the Normalization Layers

Wed, 12 Jul 2023 00:00:00 +0000

Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks. Recent studies on fine-tuning large pretrained models indicate that just tuning the parameters of these affine transforms can achieve high accuracy for downstream tasks. These findings open the questions about the expressive power of tuning the normalization layers of frozen networks. In this work, we take the first step towards this question and show that for random ReLU networks, finetuning only its normalization layers can reconstruct any target network that is $O(\sqrt{\text{width}})$ times smaller. We show that this holds even for randomly sparsified networks, under sufficient overparameterization, in agreement with prior empirical work.

Ticketed Learning–Unlearning Schemes

Wed, 12 Jul 2023 00:00:00 +0000

We consider the learning–unlearning paradigm defined as follows. First given a dataset, the goal is to learn a good predictor, such as one minimizing a certain loss. Subsequently, given any subset of examples that wish to be unlearnt, the goal is to learn, without the knowledge of the original training dataset, a good predictor that is identical to the predictor that would have been produced when learning from scratch on the surviving examples.We propose a new ticketed model for learning–unlearning wherein the learning algorithm can send back additional information in the form of a small-sized (encrypted) “ticket” to each participating training example, in addition to retaining a small amount of “central” information for later. Subsequently, the examples that wish to be unlearnt present their tickets to the unlearning algorithm, which additionally uses the central information to return a new predictor. We provide space-efficient ticketed learning–unlearning schemes for a broad family of concept classes, including thresholds, parities, intersection-closed classes, among others.En route, we introduce the count-to-zero problem, where during unlearning, the goal is to simply know if there are any examples that survived. We give a ticketed learning–unlearning scheme for this problem that relies on the construction of Sperner families with certain properties, which might be of independent interest.

Minimax optimal testing by classification

Wed, 12 Jul 2023 00:00:00 +0000

This paper considers an ML inspired approach to hypothesis testing known as classifier/classification-accuracy testing (CAT). In CAT, one first trains a classifier by feeding it labeled synthetic samples generated by the null and alternative distributions, which is then used to predict labels of the actual data samples. This method is widely used in practice when the null and alternative are only specified via simulators (as in many scientific experiments). We study goodness-of-fit, two-sample (TS) and likelihood-free hypothesis testing (LFHT), and show that CAT achieves (near-)minimax optimal sample complexity in both the dependence on the total-variation (TV) separation ε and the probability of error δ in a variety of non-parametric settings, including discrete distributions, d-dimensional distributions with a smooth density, and the Gaussian sequence model. In particular, we close the high probability sample complexity of LFHT for each class. As another highlight, we recover the minimax optimal complexity of TS over discrete distributions, which was recently established by Diakonikolas et al. (2021). The corresponding CAT simply compares empirical frequencies in the first half of the data, and rejects the null when the classification accuracy on the second half is better than random.

Community Detection in the Hypergraph SBM: Exact Recovery Given the Similarity Matrix

Wed, 12 Jul 2023 00:00:00 +0000

Community detection is a fundamental problem in network science. In this paper, we consider community detection in hypergraphs drawn from the \emph{hypergraph stochastic block model} (HSBM), with a focus on exact community recovery. We study the performance of polynomial-time algorithms which operate on the \emph{similarity matrix} $W$, where $W_{ij}$ reports the number of hyperedges containing both $i$ and $j$. Under this information model, while the precise information-theoretic limit is unknown, Kim, Bandeira, and Goemans derived a sharp threshold up to which the natural min-bisection estimator on $W$ succeeds. As min-bisection is NP-hard in the worst case, they additionally proposed a semidefinite programming (SDP) relaxation and conjectured that it achieves the same recovery threshold as the min-bisection algorithm. In this paper, we confirm this conjecture. We also design a simple and highly efficient spectral algorithm with nearly linear runtime and show that it achieves the min-bisection threshold. Moreover, the spectral algorithm also succeeds in denser regimes and is considerably more efficient than previous approaches, establishing it as the method of choice. Our analysis of the spectral algorithm crucially relies on strong \emph{entrywise} bounds on the eigenvectors of $W$. Our bounds are inspired by the work of Abbe, Fan, Wang, and Zhong, who developed entrywise bounds for eigenvectors of symmetric matrices with independent entries. Despite the complex dependency structure in similarity matrices, we prove similar entrywise guarantees.

Projection-free Online Exp-concave Optimization

Wed, 12 Jul 2023 00:00:00 +0000

We consider the setting of online convex optimization (OCO) with \textit{exp-concave} losses. The best regret bound known for this setting is $O(n\log{}T)$, where $n$ is the dimension and $T$ is the number of prediction rounds (treating all other quantities as constants and assuming $T$ is sufficiently large), and is attainable via the well-known Online Newton Step algorithm (ONS). However, ONS requires on each iteration to compute a projection (according to some matrix-induced norm) onto the feasible convex set, which is often computationally prohibitive in high-dimensional settings and when the feasible set admits a non-trivial structure. In this work we consider projection-free online algorithms for exp-concave and smooth losses, where by projection-free we refer to algorithms that rely only on the availability of a linear optimization oracle (LOO) for the feasible set, which in many applications of interest admits much more efficient implementations than a projection oracle. We present an LOO-based ONS-style algorithm, which using overall $O(T)$ calls to a LOO, guarantees in worst case regret bounded by $\widetilde{O}(n^{2/3}T^{2/3})$ (ignoring all quantities except for $n,T$). However, our algorithm is most interesting in an important and plausible low-dimensional data scenario: if the gradients (approximately) span a subspace of dimension at most $\rho$, $\rho << n$, the regret bound improves to $\widetilde{O}(\rho^{2/3}T^{2/3})$, and by applying standard deterministic sketching techniques, both the space and average additional per-iteration runtime requirements are only $O(\rho{}n)$ (instead of $O(n^2)$). This improves upon recently proposed LOO-based algorithms for OCO which, while having the same state-of-the-art dependence on the horizon $T$, suffer from regret/oracle complexity that scales with $\sqrt{n}$ or worse.

Universality of Langevin Diffusion for Private Optimization, with Applications to Sampling from Rashomon Sets

Wed, 12 Jul 2023 00:00:00 +0000

In this paper we provide an algorithmic framework based on Langevin diffusion (LD) and its corresponding discretizations that allow us to simultaneously obtain: i) An algorithm for sampling from the exponential mechanism, whose privacy analysis does not depend on convexity, and which can be stopped at anytime without compromising privacy, and ii) tight uniform stability guarantees for the exponential mechanism. As a direct consequence, we obtain optimal excess empirical and population risk guarantees for (strongly) convex losses under both pure and approximate differential privacy (DP). The framework allows us to design a DP uniform sampler from a Rashomon set. Rashomon sets are widely used in interpretable and robust machine learning, understanding variable importance, and characterizing fairness.

Geometric Barriers for Stable and Online Algorithms for Discrepancy Minimization

Wed, 12 Jul 2023 00:00:00 +0000

For many computational problems involving randomness, intricate geometric features of the solution space have been used to rigorously rule out powerful classes of algorithms. This is often accomplished through the lens of the multi Overlap Gap Property ($m$-OGP), a rigorous barrier against algorithms exhibiting input stability. In this paper, we focus on the algorithmic tractability of two models: (i) discrepancy minimization, and (ii) the symmetric binary perceptron (\texttt{SBP}), a random constraint satisfaction problem as well as a toy model of a single-layer neural network.Our first focus is on the limits of online algorithms. By establishing and leveraging a novel geometrical barrier, we obtain sharp hardness guarantees against online algorithms for both the \texttt{SBP} and discrepancy minimization. Our results match the best known algorithmic guarantees, up to constant factors. Our second focus is on efficiently finding a constant discrepancy solution, given a random matrix $\mathcal{M}\in\R^{M\times n}$. In a smooth setting, where the entries of $\mathcal{M}$ are i.i.d.\,standard normal, we establish the presence of $m$-OGP for $n=\Theta(M\log M)$. Consequently, we rule out the class of stable algorithms at this value. These results give the first rigorous evidence towards \citet[Conjecture 1]{altschuler2021discrepancy}. Our methods use the intricate geometry of the solution space to prove tight hardness results for online algorithms. The barrier we establish is a novel variant of the $m$-OGP. Furthermore, it regards $m$-tuples of solutions with respect to correlated instances, with growing values of $m$, $m=\omega(1)$. Importantly, our results rule out online algorithms succeeding even with an exponentially small probability.

Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization

Wed, 12 Jul 2023 00:00:00 +0000

Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush–Kuhn–Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estimators interpolate noisy training data and simultaneously generalize well to test data. The settings include variants of the noisy class-conditional Gaussians considered in previous work as well as new distributional settings where benign overfitting has not been previously observed. The key ingredient to our proof is the observation that when the training data is nearly-orthogonal, both linear classifiers and leaky ReLU networks satisfying the KKT conditions for their respective margin maximization problems behave like a weighted average of the training examples.

Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression

Wed, 12 Jul 2023 00:00:00 +0000

We consider contextual bandits with linear constraints (CBwLC), a variant of contextual bandits in which the algorithm consumes multiple resources subject to linear constraints on total consumption. This problem generalizes contextual bandits with knapsacks (CBwK), allowing for packing and covering constraints, as well as positive and negative resource consumption. We provide the first algorithm for CBwLC (or CBwK) that is based on regression oracles. The algorithm is simple, computationally efficient, and admits vanishing regret. It is statistically optimal for the variant of CBwK in which the algorithm must stop once some constraint is violated. Further, we provide the first vanishing-regret guarantees for CBwLC (or CBwK) that extend beyond the stochastic environment. We side-step strong impossibility results from prior work by identifying a weaker (and, arguably, fairer) benchmark to compare against. Our algorithm builds on LagrangeBwK (Immorlica et al., FOCS 2019), a Lagrangian-based technique for CBwK, and SquareCB (Foster and Rakhlin, ICML 2020), a regression-based technique for contextual bandits. Our analysis leverages the inherent modularity of both techniques.

Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient

Wed, 12 Jul 2023 00:00:00 +0000

A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision- Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts:• they hold in expectation, with no restrictions on the class of algorithms under consideration.• they hold globally, and do not rely on the notion of localization used by Foster et al. (2021).• most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role.We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.

On the Complexity of Multi-Agent Decision Making: From Learning in Games to Partial Monitoring

Wed, 12 Jul 2023 00:00:00 +0000

A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees, and how these considerations change as we move from few to many agents. We study this question in a general framework for interactive decision making with multiple agents, encompassing Markov games with function approximation and normal-form games with bandit feedback. We focus on equilibrium computation, in which a centralized learning algorithm aims to compute an equilibrium by controlling multiple agents that interact with an (unknown) environment. Our main contributions are:• We provide upper and lower bounds on the optimal sample complexity for multi-agent decision making based on a multi-agent generalization of the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. (2021) in the single-agent counterpart to our setting. Compared to the best results for the single-agent setting, our upper and lower bounds have additional gaps. We show that no “reasonable” complexity measure can close these gaps, highlighting a striking separation between single and multiple agents.• We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making, but with hidden (unobserved) rewards, a framework that subsumes variants of the partial monitoring problem. As a consequence of this connection, we characterize the statistical complexity for hidden-reward interactive decision making to the best extent possible.Building on this development, we provide several new structural results, including 1) conditions under which the statistical complexity of multi-agent decision making can be reduced to that of single-agent, and 2) conditions under which the so-called curse of multiple agents can be avoided.

Optimal Prediction Using Expert Advice and Randomized Littlestone Dimension

Wed, 12 Jul 2023 00:00:00 +0000

A classical result in online learning characterizes the optimal mistake bound achievable by deterministic learners using the Littlestone dimension (Littlestone ’88).We prove an analogous result for randomized learners: we show that the optimal expected mistake bound in learning a class $\mathcal{H}$ equals its randomized Littlestone dimension, which we define as follows: it is the largest $d$ for which there exists a tree shattered by $\mathcal{H}$ whose average depth is $2d$.We further study optimal mistake bounds in the agnostic case, as a function of the number of mistakes made by the best function in $\mathcal{H}$, denoted by $k$. Towards this end we introduce the $k$-Littlestone dimension and its randomized variant, and use them to characterize the optimal deterministic and randomized mistake bounds.Quantitatively, we show that the optimal randomized mistake bound for learning a class with Littlestone dimension $d$ is $k + \Theta (\sqrt{k d} + d )$ (equivalently, the optimal regret is $\Theta(\sqrt{kd} + d$). This also implies an optimal deterministic mistake bound of $2k + O (\sqrt{k d} + d )$, thus resolving an open question which was studied by Auer and Long [’99]. As an application of our theory, we revisit the classical problem of prediction using expert advice: about 30 years ago Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth studied prediction using expert advice, provided that the best among the $n$ experts makes at most $k$ mistakes, and asked what are the optimal mistake bounds (as a function of $n$ and $k$). Cesa-Bianchi, Freund, Helmbold, and Warmuth [’93, ’96] provided a nearly optimal bound for deterministic learners, and left the randomized case as an open problem. We resolve this question by providing an optimal learning rule in the randomized case, and showing that its expected mistake bound equals half of the deterministic bound, up to negligible additive terms. This improves upon previous works by Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth [’93, ’97], by Abernethy, Langford, and Warmuth [’06], and by Brânzei and Peres [’19], which handled the regimes $k \ll \log n$ or $k\gg \log n$. In contrast, our result applies to all pairs $n,k$, and does so via a unified analysis using the randomized Littlestone dimension.In our proofs we develop and use optimal learning rules, which can be seen as natural variants of the Standard Optimal Algorithm ($\mathsf{SOA}$) of Littlestone: a weighted variant in the agnostic case, and a probabilistic variant in the randomized case. We conclude the paper with suggested directions for future research and open questions.

Approximately Stationary Bandits with Knapsacks

Wed, 12 Jul 2023 00:00:00 +0000

Bandits with Knapsacks (BwK), the generalization of the Multi-Armed Bandits problem under global budget constraints, has received a lot of attention in recent years. It has numerous applications, including dynamic pricing, repeated auctions, ad allocation, network scheduling, etc. Previous work has focused on one of the two extremes: Stochastic BwK where the rewards and consumptions of the resources of each round are sampled from an i.i.d. distribution, and Adversarial BwK where these parameters are picked by an adversary. Achievable guarantees in the two cases exhibit a massive gap: No-regret learning is achievable in the stochastic case, but in the adversarial case only competitive ratio style guarantees are achievable, where the competitive ratio depends either on the budget or on both the time and the number of resources. What makes this gap so vast is that in Adversarial BwK the guarantees get worse in the typical case when the budget is more binding. While “best-of-both-worlds” type algorithms are known (single algorithms that provide the best achievable guarantee in each extreme case), their bounds degrade to the adversarial case as soon as the environment is not fully stochastic.Our work aims to bridge this gap, offering guarantees for a workload that is not exactly stochastic but is also not worst-case. We define a condition, Approximately Stationary BwK, that parameterizes how close to stochastic or adversarial an instance is. Based on these parameters, we explore what is the best competitive ratio attainable in BwL. We explore two algorithms that are oblivious to the values of the parameters but guarantee competitive ratios that smoothly transition between the best possible guarantees in the two extreme cases, depending on the values of the parameters. Our guarantees offer great improvement over the adversarial guarantee, especially when the available budget is small. We also prove bounds on the achievable guarantee, showing that our results are approximately tight when the budget is small.

Quantum Channel Certification with Incoherent Measurements

Wed, 12 Jul 2023 00:00:00 +0000

In the problem of quantum channel certification, we have black box access to a quantum process and would like to decide if this process matches some predefined specification or is $\eps$-far from this specification. The objective is to achieve this task while minimizing the number of times the black box is used. Note that the state certification problem is a special case where the black box has no input. Here, we focus on two relevant extreme cases. The first one is when the predefined specification is a unitary channel, e.g., a gate in a quantum circuit. In this case, we show that testing whether the black box is described by a fixed unitary or $\eps$-far from it in the trace norm requires $\Theta(d/\eps^2)$ uses of the black box. The second setting we consider is when the predefined specification is a completely depolarizing channels with input dimension $\din$ and output dimension $\dout$. In this case, we prove that, in the non-adaptive setting, $\Tilde{\Theta}(\din^2\dout^{1.5}/\eps^2)$ uses of the channel are necessary and sufficient to verify whether it is equal to the depolarizing channel or $\eps$-far from it in the diamond norm. Finally, we prove a lower bound of $\Omega(\din^2\dout/\eps^2)$ for this problem in the adaptive setting. Note that the special case $\din = 1$ corresponds to the well-studied quantum identity testing problem.

Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD

Wed, 12 Jul 2023 00:00:00 +0000

This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth functions proposed by Zhang et al. (ICLR’20). Empirical evidence suggests that these functions more closely capture practical machine learning problems as compared to the pervasive $L_0$-smoothness. This class is rich enough to include highly non-smooth functions, such as $\exp(L_1 x)$ which is $(0,\mathcal{O}(L_1))$-smooth. Despite the richness, an emerging line of works achieves the $\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the $\L_0$-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate.We develop a technique that allows us to prove $\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$ convergence rates for $(L_0,L_1)$-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time $\tau$ which is simultaneously “large” on average, yet also allows us to treat the adaptive step sizes before $\tau$ as (roughly) independent of the gradients. For general $(L_0,L_1)$-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter $\sigma_1 < 1$. For a broad subclass of $(L_0,L_1)$-smooth functions, our convergence rate continues to hold when $\sigma_1 \geq 1$. By contrast, we prove that many algorithms analyzed by prior works on $(L_0,L_1)$-smooth optimization diverge with constant probability even for smooth and strongly-convex functions when $\sigma_1 > 1$.

Efficient Algorithms for Sparse Moment Problems without Separation

Wed, 12 Jul 2023 00:00:00 +0000

We consider the sparse moment problem of learning a $k$-spike mixture in high-dimensional space from its noisy moment information in any dimension. We measure the accuracy of the learned mixtures using transportation distance. Previous algorithms either assume certain separation assumptions, use more recovery moments, or run in (super) exponential time. Our algorithm for the 1-dimensional problem (also called the sparse Hausdorff moment problem) is a robust version of the classic Prony’s method, and our contribution mainly lies in the analysis. We adopt a global and much tighter analysis than previous work (which analyzes the perturbation of the intermediate results of Prony’s method). A useful technical ingredient is a connection between the linear system defined by the Vandermonde matrix and the Schur polynomial, which allows us to provide tight perturbation bound independent of the separation and may be useful in other contexts. To tackle the high-dimensional problem, we first solve the 2-dimensional problem by extending the 1-dimensional algorithm and analysis to complex numbers. Our algorithm for the high-dimensional case determines the coordinates of each spike by aligning a 1d projection of the mixture to a random vector and a set of 2d projections of the mixture. Our results have applications to learning topic models and Gaussian mixtures, implying improved sample complexity results or running time over prior work.

Improved dimension dependence of a proximal algorithm for sampling

Wed, 12 Jul 2023 00:00:00 +0000

We propose a sampling algorithm that achieves superior complexity bounds in all the classical settings (strongly log-concave, log-concave, Logarithmic-Sobolev inequality (LSI), Poincaré inequality) as well as more general settings with semi-smooth or composite potentials. Our algorithm is based on the proximal sampler introduced in Lee et al. 2021. The performance of this proximal sampler is determined by that of the restricted Gaussian oracle (RGO), a key step in the proximal sampler. The main contribution of this work is an inexact realization of RGO based on approximate rejection sampling. To bound the inexactness of RGO, we establish a new concentration inequality for semi-smooth functions over Gaussian distributions, extending the well-known concentration inequality for Lipschitz functions. Applying our RGO implementation to the proximal sampler, we achieve state-of-the-art complexity bounds in almost all settings. For instance, for strongly log-concave distributions, our method has complexity bound $\tilde\cO(\kappa d^{1/2})$ without warm start, better than the minimax bound for MALA. For distributions satisfying the LSI, our bound is $\tilde \cO(\hat \kappa d^{1/2})$ where $\hat \kappa$ is the ratio between smoothness and the LSI constant, better than all existing bounds.

From Pseudorandomness to Multi-Group Fairness and Back

Wed, 12 Jul 2023 00:00:00 +0000

We identify and explore connections between the recent literature on multi-group fairness for prediction algorithms and the pseudorandomness notions of leakage-resilience and graph regularity. We frame our investigation using new, statistical distance-based variants of multicalibration that are closely related to the concept of outcome indistinguishability. Adopting this perspective leads us naturally not only to our graph theoretic results, but also to new, more efficient algorithms for multicalibration in certain parameter regimes and a novel proof of a hardcore lemma for real-valued functions.

Toward L_∞Recovery of Nonlinear Functions: A Polynomial Sample Complexity Bound for Gaussian Random Fields

Wed, 12 Jul 2023 00:00:00 +0000

Many machine learning applications require learning a function with a small worst-case error over the entire input domain, that is, the $L_\infty$-error, whereas most existing theoretical works only guarantee recovery in average errors such as the $L_2$-error. $L_\infty$-recovery from polynomial samples is even impossible for seemingly simple function classes such as constant-norm infinite-width two-layer neural nets. This paper makes some initial steps beyond the impossibility results by leveraging the randomness in the ground-truth functions. We prove a polynomial sample complexity bound for random ground-truth functions drawn from Gaussian random fields. Our key technical novelty is to prove that the degree-$k$ spherical harmonics components of a function from Gaussian random field cannot be spiky in that their $L_\infty$/$L_2$ ratios are upperbounded by $O(d \sqrt{\ln k})$ with high probability. In contrast, the worst-case $L_\infty$/$L_2$ ratio for degree-$k$ spherical harmonics is on the order of $\Omega(\min\{d^{k/2},k^{d/2}\})$.

Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions

Wed, 12 Jul 2023 00:00:00 +0000

We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, $y = g(w^* \cdot x) + \xi + \eps$ where $\xi$ is the oblivious noise drawn independently of $x$, satisfying $\Pr[\xi = 0] \geq o(1)$, and $\eps \sim \cN(0, \sigma^2)$. Our goal is to accurately recover a function $g(w \cdot x)$ with arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles the problem in its most general distribution-independent setting, where the solution may not be identifiable. The algorithm is designed to return the solution if it is identifiable, and otherwise return a small list of candidates, one of which is close to the true solution. Furthermore, we characterize a necessary and sufficient condition for identifiability, which holds in broad settings. The problem is identifiable when the quantile at which $\xi + \eps = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first result for GLM regression which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression with oblivious noise, and giving algorithms under more restrictive assumptions.

Statistical and Computational Limits for Tensor-on-Tensor Association Detection

Wed, 12 Jul 2023 00:00:00 +0000

In this paper, we consider the tensor-on-tensor association detection problem, where the goal is to detect whether there is an association between the tensor responses to tensor covariates linked via a low-rank tensor parameter. We first In this paper, we consider the tensor-on-tensor association detection problem, where the goal is to detect whether there is an association between the tensor responses to tensor covariates linked via a low-rank tensor parameter. We first develop tight bounds on the signal-to-noise ratio (SNR) such that the detection problem is statistically possible. We then provide testing procedures that succeed when the SNR is above the threshold. On the other hand, the statistical optimal tests often require computing the largest singular value of a given tensor, which can be NP-hard in general. To complement that, we develop efficient polynomial-time testing procedures with provable guarantees. We also develop matching lower bounds under the Statistical Query model and show that the SNRs required by the proposed polynomial-time algorithms are essential for computational efficiency. We identify a gap that appears between the SNR requirements of the optimal unconstrained-time tests and polynomial-time tests if and only if the sum of the tensor response order and the tensor covariate order is no less than three. To our best knowledge, this is the first complete characterization of the statistical and computational limits for the general tensor-on-tensor association detection problem. Our findings significantly generalize the results in the literature on signal detection in linear regression and low-rank matrix trace regression. Finally, the connection on the computational hardness of the detection problem and the corresponding estimation problem is discussed.

Self-Directed Linear Classification

Wed, 12 Jul 2023 00:00:00 +0000

In online classification, a learner is presented with a sequence of examples and aims to predict their labels in an online fashion so as to minimize the total number of mistakes. In the self-directed variant, the learner knows in advance the pool of examples and can adaptively choose the order in which predictions are made. Here we study the power of choosing the prediction order and establish the first strong separation between worst-order and random-order learning for the fundamental task of linear classification. Prior to our work, such a separation was known only for very restricted concept classes, e.g., one-dimensional thresholds or axis-aligned rectangles.We present two main results.If $X$ is a dataset of $n$ points drawn uniformly at random from the $d$-dimensional unit sphere, we design an efficient self-directed learner thatmakes $O(d \log \log(n))$ mistakes and classifies the entire dataset.If $X$ is an arbitrary $d$-dimensional dataset of size $n$, we design an efficient self-directed learner that predicts the labels of $99%$ of the points in $X$ with mistake bound independent of $n$. In contrast, under a worst- or random-ordering, the number of mistakes must be at least $\Omega(d \log n)$, even when the points are drawn uniformly from the unit sphere and the learner only needs to predict the labels for $1%$ of them.

SQ Lower Bounds for Learning Mixtures of Separated and Bounded Covariance Gaussians

Wed, 12 Jul 2023 00:00:00 +0000

We study the complexity of learning mixtures of separated Gaussians with common unknown bounded covariance matrix. Specifically, we focus on learning Gaussian mixture models (GMMs) on $\mathbb{R}^d$ of the form $P= \sum_{i=1}^k w_i \mathcal{N}(\vec \mu_i,\vec \Sigma_i)$, where $\vec \Sigma_i = \vec \Sigma \preceq \vec I$and $\min_{i \neq j} \|\vec \mu_i - \vec \mu_j\|_2 \geq k^\epsilon$ for some $\epsilon>0$. Known learning algorithms for this family of GMMs have complexity $(dk)^{O(1/\epsilon)}$. In this work, we prove that any Statistical Query (SQ) algorithm for this problem requires complexity at least $d^{\Omega(1/\epsilon)}$. Our SQ lower bound implies a similar lower bound for low-degree polynomial tests. Our result provides evidence that known algorithms for this problem are nearly best possible.

Information-Computation Tradeoffs for Learning Margin Halfspaces with Random Classification Noise

Wed, 12 Jul 2023 00:00:00 +0000

We study the problem of PAC learning $\gamma$-margin halfspaces with Random Classification Noise. We establish an information-computation tradeoffsuggesting an inherent gap between the sample complexity of the problem and the sample complexity of computationally efficient algorithms. Concretely, the sample complexity of the problem is $\widetilde{\Theta}(1/(\gamma^2 \epsilon))$. We start by giving a simple efficient algorithm with sample complexity $\widetilde{O}(1/(\gamma^2 \epsilon^2))$. Our main resultis a lower bound for Statistical Query (SQ) algorithms and low-degree polynomial tests suggesting that the quadratic dependence on $1/\epsilon$ in the sample complexity is inherent for computationally efficient algorithms.Specifically, our results imply a lower bound of $\widetilde{\Omega}(1/(\gamma^{1/2} \epsilon^2))$ on the sample complexity of any efficient SQ learner or low-degree test.

Sharp analysis of EM for learning mixtures of pairwise differences

Wed, 12 Jul 2023 00:00:00 +0000

We consider a symmetric mixture of linear regressions with random samples from the pairwise comparison design, which can be seen as a noisy version of a type of Euclidean distance geometry problem. We analyze the expectation-maximization (EM) algorithm locally around the ground truth and establish that the sequence converges linearly, providing an $\ell_\infty$-norm guarantee on the estimation error of the iterates. Furthermore, we show that the limit of the EM sequence achieves the sharp rate of estimation in the $\ell_2$-norm, matching the information-theoretically optimal constant. We also argue through simulation that convergence from a random initialization is much more delicate in this setting, and does not appear to occur in general. Our results show that the EM algorithm can exhibit several unique behaviors when the covariate distribution is suitably structured.

Law of Large Numbers for Bayesian two-layer Neural Network trained with Variational Inference

Wed, 12 Jul 2023 00:00:00 +0000

We provide a rigorous analysis of training by variational inference (VI) of Bayesian neural networks in the two-layer and infinite-width case. We consider a regression problem with a regularized evidence lower bound (ELBO) which is decomposed into the expected log-likelihood of the data and the Kullback-Leibler (KL) divergence between the a priori distribution and the variational posterior. With an appropriate weighting of the KL, we prove a law of large numbers for three different training schemes: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes by Backprop, and (iii) a new and computationally cheaper algorithm which we introduce as Minimal VI. An important result is that all methods converge to the same mean-field limit. Finally, we illustrate our results numerically and discuss the need for the derivation of a central limit theorem.

Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs

Wed, 12 Jul 2023 00:00:00 +0000

Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions. However, this phenomenon has only been studied for specific tasks and metrics, or by relying on computationally expensive methods. We address this by providing an algorithmic framework for gaussianizing data using sparse sketching operators, proving that it is possible to efficiently construct data sketches that are nearly indistinguishable (in terms of total variation distance) from sub-gaussian random designs. In particular, relying on a recently introduced sketching technique called Leverage Score Sparsified (LESS) embeddings, we show that one can construct an $n\times d$ sketch of an$N\times d$ matrix $A$, where $n\ll N$, that is nearly indistinguishable from a sub-gaussian design, in time $O(\mathrm{nnz}(A)\log N + nd^2)$, where $\mathrm{nnz}(A)$ is the number of non-zero entries in $A$. As a consequence, strong statistical guarantees and precise asymptotics available for the estimators produced from sub-gaussian designs (e.g., for least squares and Lasso regression, covariance estimation, low-rank approximation, etc.) can be straightforwardly adapted to our sketching framework. We illustrate this with a new approximation guarantee for sketched least squares, among other examples. The key technique that enables our analysis is a novel variant of the Hanson-Wright inequality on the concentration of random quadratic forms, which we establish for random vectors that arise from sparse sketches.

Find a witness or shatter: the landscape of computable PAC learning.

Wed, 12 Jul 2023 00:00:00 +0000

This paper contributes to the study of CPAC learnability —a computable version of PAC learning– by solving three open questions from recent papers. Firstly, we prove that every improperly CPAC learnable class is contained in a class which is properly CPAC learnable with polynomial sample complexity. This confirms a conjecture by Agarwal et al (COLT 2021). Secondly, we show that there exists a decidable class of hypotheses which is properly CPAC learnable, but only with uncomputably fast-growing sample complexity. This solves a question from Sterkenburg (COLT 2022). Finally, we construct a decidable class of finite Littlestone dimension which is not improperly CPAC learnable, strengthening a recent result of Sterkenburg (2022) and answering a question posed by Hasrati and Ben-David (ALT 2023). Together with previous work, our results provide a complete landscape for the learnability problem in the CPAC setting.

On the Existence of a Complexity in Fixed Budget Bandit Identification

Wed, 12 Jul 2023 00:00:00 +0000

In fixed budget bandit identification, an algorithm sequentially observes samples from several distributions up to a given final time.It then answers a query about the set of distributions. A good algorithm will have a small probability of error.While that probability decreases exponentially with the final time, the best attainable rate is not known precisely for most identification tasks.We show that if a fixed budget task admits a complexity, defined as a lower bound on the probability of error which is attained by the same algorithm on all bandit problems, then that complexity is determined by the best non-adaptive sampling procedure for that problem.We show that there is no such complexity for several fixed budget identification tasks including Bernoulli best arm identification with two arms: there is no single algorithm that attains everywhere the best possible rate.

STay-ON-the-Ridge: Guaranteed Convergence to Local Minimax Equilibrium in Nonconvex-Nonconcave Games

Wed, 12 Jul 2023 00:00:00 +0000

Min-max optimization problems involving nonconvex-nonconcave objectives have found important applications in adversarial training and other multi-agent learning settings. Yet, no known gradient descent-based method is guaranteed to converge to (even local notions of) min-max equilibrium in the nonconvex-nonconcave setting. For all known methods, there exist relatively simple objectives for which they cycle or exhibit other undesirable behavior different from converging to a point, let alone to some game-theoretically meaningful one \citep{flokas2019poincare,hsieh2021limits}. The only known convergence guarantees hold under the strong assumption that the initialization is very close to a local min-max equilibrium \citep{wang2019solving}. Moreover, the afore-described challenges are not just theoretical curiosities. All known methods are unstable in practice, even in simple settings.We propose the first method that is guaranteed to converge to a local min-max equilibrium for smooth nonconvex-nonconcave objectives. Our method is second-order and provably escapes limit cycles as long as it is initialized at an easy-to-find initial point. Both the definition of our method and its convergence analysis are motivated by the topological nature of the problem. In particular, our method is not designed to decrease some potential function, such as the distance of its iterate from the set of local min-max equilibria or the projected gradient of the objective, but is designed to satisfy a topological property that guarantees the avoidance of cycles and implies its convergence.

The Complexity of Markov Equilibrium in Stochastic Games

Wed, 12 Jul 2023 00:00:00 +0000

We show that computing approximate stationary Markov coarse correlated equilibria (CCE) in general-sum stochastic games is PPAD-hard, even when there are two players, the game is turn-based, the discount factor is an absolute constant, and the approximation is an absolute constant. Our intractability results stand in sharp contrast to the results in normal-form games, where exact CCEs are efficiently computable. A fortiori, our results imply that, in the setting of multi-agent reinforcement learning (MARL), it is computationally hard to learn stationary Markov CCE policies in stochastic games, even when the interaction is two-player and turn-based, and both the discount factor and the desired approximation of the learned policies is an absolute constant. In turn, these results stand in sharp contrast to single-agent reinforcement learning (RL) where near-optimal stationary Markov policies can be computationally efficiently learned. Complementing our intractability results for stationary Markov CCEs, we provide a decentralized algorithm (assuming shared randomness among players) for learning a nonstationary Markov CCE policy with polynomial time and sample complexity in all problem parameters. Previous work for learning Markov CCE policies all required exponential time and sample complexity in the number of players. In balance, our work advocates for the use of nonstationary Markov CCE policies as a computationally and statistically tractable solution concept in MARL, advancing an important and outstanding frontier in machine learning.

Utilising the CLT Structure in Stochastic Gradient based Sampling : Improved Analysis and Faster Algorithms

Wed, 12 Jul 2023 00:00:00 +0000

We consider stochastic approximations of sampling algorithms, such as Stochastic Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD). We observe that the noise introduced by the stochastic approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian. We harness this structure to absorb the stochastic approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms. For SGLD, we prove the first stable convergence rate in KL divergence without requiring uniform warm start, assuming the target density satisfies a Log-Sobolev Inequality. Our result implies superior first-order oracle complexity compared to prior works, under significantly milder assumptions. We also prove the first guarantees for SGLD under even weaker conditions such as Hölder smoothness and Poincare Inequality, thus bridging the gap between the state-of-the-art guarantees for LMC and SGLD. Our analysis motivates a new algorithm called covariance correction, which corrects for the additional noise introduced by the stochastic approximation by rescaling the strength of the diffusion. Finally, we apply our techniques to analyze RBM, and significantly improve upon the guarantees in prior works (such as removing exponential dependence on horizon), under minimal assumptions.

Near Optimal Heteroscedastic Regression with Symbiotic Learning

Wed, 12 Jul 2023 00:00:00 +0000

We consider the classical problem of heteroscedastic linear regression where, given $n$ i.i.d. samples $(\vx_i, y_i)$ drawn from the model $y_i = \dotp{\wstar}{\vx_i} + \epsilon_i \cdot \dotp{\fstar}{\vx_i}, \vx_i \sim \cN(0, \vI), \epsilon_i \sim \cN(0, 1)$, our aim is to \emph{estimate the regressor} $\wstar$ \emph{without prior knowledge of the noise parameter} $\fstar$. In addition to classical applications of such models in statistics \citep{jobson1980least}, econometrics \citep{harvey1976estimating}, time series analysis \citep{engle1982autoregressive} etc., it is also particularly relevant in machine learning problems where data is collected from multiple sources of varying (but apriori unknown) quality, e.g., in the training of large models \citep{devlin2019bert} on web-scale data. In this work, we develop an algorithm called \emph{\ouralg} (short for \emph{Symb}iotic \emph{Learn}ing) which estimates $\wstar$ in squared norm upto an error of $\Otilde(\| \fstar \|^2 \cdot (\nicefrac{1}{n} + (\nicefrac{d}{n})^2))$, and prove that this rate is minimax optimal modulo logarithmic factors. This represents a substantial improvement upon the previous best known upper bound of $\Otilde(\| \fstar \|^2 \cdot \nicefrac{d}{n})$. Our algorithm is essentially an alternating minimization procedure which comprises of two key subroutines 1. An adaptation of the classical weighted least squares heuristic to estimate $\wstar$ (dating back to at least \citet{davidian1987variance}), for which our work presents the first non-asymptotic guarantee; 2. A novel non-convex pseudogradient descent procedure for estimating $\fstar$, which draws inspiration from the phase retrieval literature. As corollaries of our analysis, we obtain fast non-asymptotic rates for two important problems, linear regression with multiplicative noise, and phase retrieval with multiplicative noise, both of which could be of independent interest. Beyond this, the proof of our lower bound, which involves a novel adaptation of LeCam’s two point method for handling infinite mutual information quantities (thereby preventing a direct application of standard techniques such as Fano’s method), could also be of broader interest for establishing lower bounds for other heteroscedastic or heavy tailed statistical problems.

A Blackbox Approach to Best of Both Worlds in Bandits and Beyond

Wed, 12 Jul 2023 00:00:00 +0000

Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the adversarial and the stochastic regimes have received growing attention recently. Existing techniques often require careful adaptation to every new problem setup, including specialized potentials and careful tuning of algorithm parameters. Yet, in domains such as linear bandits, it is still unknown if there exists an algorithm that can obtain $O(\log(T))$ regret in the stochastic regime and $\tilde{O}(\sqrt{T})$ regret in the adversarial regime. In this work, we resolve this question positively and present a generally applicable reduction from best of both worlds to a wide family of follow-the-regularized-leader (FTRL) algorithms. We showcase the capability of this reduction by transforming existing algorithms that only achieve worst-case guarantees into new best-of-both-worlds algorithms in the setting of contextual bandits, graph bandits and tabular Markov decision processes.

Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation

Wed, 12 Jul 2023 00:00:00 +0000

We propose a new model, \emph{independent linear Markov game}, for multi-agent reinforcement learning with a large state space and a large number of agents.This is a class of Markov games with \emph{independent} linear function approximation, where each agent has its own function approximation for the state-action value functions that are {\it marginalized} by other players’ policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with \emph{each agent’s own function class complexity}, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle {\it non-stationarity} incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games, with applications in learning in congestion games.In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in \citep{daskalakis2022complexity}, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters. Furthermore, we design the first provably efficient algorithm for learning Markov CE that breaks the curse of multiagents.

Open Problem: Polynomial linearly-convergent method for g-convex optimization?

Wed, 12 Jul 2023 00:00:00 +0000

Let $f \colon \mathcal{M} \to \mathbb{R}$ be a Lipschitz and geodesically convex function defined on a $d$-dimensional Riemannian manifold $\mathcal{M}$. Does there exist a first-order deterministic algorithm which (a) uses at most $O(\mathrm{poly}(d) \log(\epsilon^{-1}))$ subgradient queries to find a point with target accuracy $\epsilon$, and (b) requires only $O(\mathrm{poly}(d))$ arithmetic operations per query? In convex optimization, the classical ellipsoid method achieves this. After detailing related work, we provide an ellipsoid-like algorithm with query complexity $O(d^2 \log^2(\epsilon^{-1}))$ and per-query complexity $O(d^2)$ for the limited case where $\mathcal{M}$ has constant curvature (hemisphere or hyperbolic space). We then detail possible approaches and corresponding obstacles for designing an ellipsoid-like method for general Riemannian manifolds.

Curvature and complexity: Better lower bounds for geodesically convex optimization

Wed, 12 Jul 2023 00:00:00 +0000

We study the query complexity of geodesically convex (g-convex) optimization on a manifold. To isolate the effect of that manifold’s curvature, we primarily focus on hyperbolic spaces. In a variety of settings (smooth or not; strongly g-convex or not; high- or low-dimensional), known upper bounds worsen with curvature. It is natural to ask whether this is warranted, or an artifact.For many such settings, we propose a first set of lower bounds which indeed confirm that (negative) curvature is detrimental to complexity. To do so, we build on recent lower bounds (Hamilton and Moitra, 2021; Criscitiello and Boumal, 2022) for the particular case of smooth, strongly g-convex optimization. Using a number of techniques, we also secure lower bounds which capture dependence on condition number and optimality gap, which was not previously the case.We suspect these bounds are not optimal. We conjecture optimal ones, and support them with a matching lower bound for a class of algorithms which includes subgradient descent, and a lower bound for a related game. Lastly, to pinpoint the difficulty of proving lower bounds, we study how negative curvature influences (and sometimes obstructs) interpolation with g-convex functions.

Open problem: log(n) factor in "Local Glivenko-Cantelli"

Wed, 12 Jul 2023 00:00:00 +0000

Can the log(n) factor in the upper bound of Cohen and Kontorovich (COLT, 2023)be removed?

Local Glivenko-Cantelli

Wed, 12 Jul 2023 00:00:00 +0000

If $\mu$ is a distribution over the $d$-dimensional Boolean cube $\set{0,1}^d$, our goal is to estimate its mean $p\in[0,1]^d$ based on $n$ iid draws from $\mu$. Specifically, we consider the empirical mean estimator $\pn$ and study the expected maximal deviation $\Delta_n=\E\max_{j\in[d]}|\pn(j)-p(j)|$. In the classical Universal Glivenko-Cantelli setting, one seeks distribution-free (i.e., independent of $\mu$) bounds on $\Delta_n$. This regime is well-understood: for all $\mu$, we have $\Delta_n\lesssim\sqrt{\log(d)/n}$ up to universal constants, and the bound is tight.Our present work seeks to establish dimension-free (i.e., without an explicit dependence on $d$) estimates on $\Delta_n$, including those that hold for $d=\infty$. As such bounds must necessarily depend on $\mu$, we refer to this regime as {\em local} Glivenko-Cantelli (also known as $\mu$-GC), and are aware of very few previous bounds of this type — which are either “abstract” or quite sub-optimal. Already the special case of product measures $\mu$ is rather non-trivial. We give necessary and sufficient conditions on $\mu$ for $\Delta_n\to0$, and calculate sharp rates for this decay. Along the way, we discover a novel sub-gamma-type maximal inequality for shifted Bernoullis, of independent interest.

Bregman Deviations of Generic Exponential Families

Wed, 12 Jul 2023 00:00:00 +0000

We revisit the method of mixtures, or Laplace method, to study the concentration phenomenon in generic (possibly multidimensional) exponential families. Using the duality properties of the Bregman divergence associated with the log-partition function of the family to construct nonnegative martingales, we establish a generic bound controlling the deviation between the parameter of the family and a finite sample estimate, expressed in the local geometry induced by the Bregman pseudo-metric. Our bound is time-uniform and involves a quantity extending the classical information gain to exponential families, which we call the Bregman information gain.For the practitioner, we instantiate this novel bound to several classical families, e.g., Gaussian (including with unknown variance or multivariate), Bernoulli, Exponential, Weibull, Pareto, Poisson and Chi-square, yielding explicit forms of the confidence sets and the Bregman information gain. We further compare the resulting confidence bounds to state-of-the-art time-uniform alternatives and show this novel method yields competitive results. Finally, we apply our result to the design of generalized likelihood ratio tests for change detection, capturing new settings such as variance change in Gaussian families.

Learning Narrow One-Hidden-Layer ReLU Networks

Wed, 12 Jul 2023 00:00:00 +0000

We consider the well-studied problem of learning a linear combination of $k$ ReLU activations with respect to a Gaussian distribution on inputs in $d$ dimensions. We give the first polynomial-time algorithm that succeeds whenever $k$ is a constant. All prior polynomial-time learners require additional assumptions on the network, such as positive combining coefficients or the matrix of hidden weight vectors being well-conditioned.Our approach is based on analyzing random contractions of higher-order moment tensors. We use a multi-scale clustering procedure to argue that sufficiently close neurons can be collapsed together, sidestepping the conditioning issues present in prior work. This allows us to design an iterative procedure to discover individual neurons.

Fast Algorithms for a New Relaxation of Optimal Transport

Wed, 12 Jul 2023 00:00:00 +0000

We introduce a new class of objectives for optimal transport computations of datasets in high-dimensional Euclidean spaces. The new objectives are parametrized by $\rho \geq 1$, and provide a metric space $\mathcal{R}_{\rho}(\cdot, \cdot)$ for discrete probability distributions in $\mathbb{R}^d$. As $\rho$ approaches $1$, the metric approaches the Earth Mover’s distance, but for $\rho$ larger than (but close to) $1$, admits significantly faster algorithms. Namely, for distributions $\mu$ and $\nu$ supported on $n$ and $m$ vectors in $\mathbb{R}^d$ of norm at most $r$ and any $\epsilon > 0$, we give an algorithm which outputs an additive $\epsilon r$ approximation to $\mathcal{R}_{\rho}(\mu, \nu)$ in time $(n+m) \cdot \mathrm{poly}((nm)^{(\rho-1)/\rho} \cdot 2^{\rho / (\rho-1)} / \epsilon)$.

Testing of Index-Invariant Properties in the Huge Object Model

Wed, 12 Jul 2023 00:00:00 +0000

Distribution testing is a central part of property testing, with applications to various research areas, such as computational and statistical learning, information theory, and probabilistic program checking. The original distribution testing model relies on samples drawn independently from the distribution to be tested. However, when the distribution in question is over the $n$-dimensional Hamming cube $\left\{0,1\right\}^{n}$ for a large $n$, even reading a few samples is infeasible. To address this, Goldreich and Ron [ITCS 2022] have defined a model called the \emph{huge object model}, in which the samples may only be queried in a few places.For any sample/query model, the following three questions are considered fundamental: {\bf (i)} understand what classes of objects can be “learned \emph{easily}", {\bf (ii)} characterize {\em testable properties}, that is, properties that can be tested in the given sample/query model using a constant number of samples/queries, and {\bf (iii)} understand the {\em gap} between {\em adaptive} and {\em non-adaptive} query/sample complexities.In this work, we study these questions for the huge object model for distribution testing. To do so, we initiate a study of a general class of distribution properties that are invariant under a permutation of the indices of the vectors in $\left\{0,1\right\}^{n}$, while still not being necessarily fully symmetric as per the definition used in traditional distribution testing.We prove that every distribution over $\left\{0,1\right\}^{n}$ whose support has a bounded VC-dimension can be efficiently learned up to a permutation. The number of queries made by the algorithm depends only on the VC-dimension of the support of the distribution and is independent of $n$. This gives efficient testers for index-invariant distribution properties that admit a global VC-dimension bound. To complement this result, we argue that satisfying only index-invariance or only a VC-dimension bound is insufficient to guarantee a tester whose query complexity is independent of $n$. Moreover, we prove that the dependency of the sample and query complexities of our tester on the VC-dimension is essentially tight. As a second part of this work, we address the question of thenumber of queries required for non-adaptive testing. We show that it can be at most quadratic in the numberof queries required for an adaptive tester in the case of index-invariant properties. This contrasts with the tight (easily provable) exponential gap between adaptive and non-adaptive testers for general non-index-invariant properties. Finally, we provide an index-invariant property for which the quadratic gap between adaptive and non-adaptive query complexities for testing is almost tight.

Open Problem: Is There a First-Order Method that Only Converges to Local Minimax Optima?

Wed, 12 Jul 2023 00:00:00 +0000

Can we effectively train a generative adversarial network (GAN), i.e., optimize a minimax problem, similar to classification neural networks, i.e., minimize a function, by gradient methods? Currently, the answer to this question is “No”. Despite the extensive studies, training GANs still remains challenging, and diffusion-based generative models are largely replacing GANs. When training GANs, we not only struggle with finding stationary points, but also suffer from the so-called mode-collapse phenomenon, generating samples that lack diversity compared to the training data. Due to the nature of GANs, mode-collapse is likely to occur when we find an optimal point for the maximin problem, rather than the original minimax problem.This suggests that answering to an open question of whether there exists a first-order method that only converges to (local) optimum of minimax problems can resolve these issues. None of the existing methods possess such a property, neither theoretically nor practically. This is in contrast to standard gradient descent successfully finding local minima. In nonconvex-nonconcave minimax problems, Jin et al. are the first to suggest an appropriate notion of local optimality, taking account of the order of minimization and maximization. They also presented a partial answer to the question above, showing that two-timescale gradient descent ascent converges to strict local minimax optima. However, the convergence to general local minimax optimum was left mostly unexplored, even though non-strict local minimax optima are prevalent. Our recent findings illustrate that it is possible to find some non-strict local minimax optima by a two-timescale extragradient method.This positive result brings new attention to the open question. Furthermore, we wish to revive discussion on the appropriate notion of local minimax optimum. This was initially discussed by Jin et al., but not much thereafter, which we believe is crucial in answering the open question.

Repeated Bilateral Trade Against a Smoothed Adversary

Wed, 12 Jul 2023 00:00:00 +0000

We study repeated bilateral trade where an adaptive $\sigma$-smooth adversary generates the valuations of sellers and buyers. We provide a complete characterization of the regret regimes for fixed-price mechanisms under different feedback models in the two cases where the learner can post either the same or different prices to buyers and sellers.We begin by showing that the minimax regret after $T$ rounds is of order $\sqrt{T}$ in the full-feedback scenario. Under partial feedback, any algorithm that has to post the same price to buyers and sellers suffers worst-case linear regret. However, when the learner can post two different prices at each round, we design an algorithm enjoying regret of order $T^{3/4}$ ignoring log factors.We prove that this rate is optimal by presenting a surprising $T^{3/4}$ lower bound, which is the main technical contribution of the paper.

The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks

Wed, 12 Jul 2023 00:00:00 +0000

We study the implicit bias of batch normalization trained by gradient descent. We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2t))$ convergence rate. This distinguishes linear models with batch normalization from those without batch normalization in terms of both the type of implicit bias and the convergence rate. We then further extend our result to a class of two-layer, single-filter convolutional neural networks, and show that batch normalization has an implicit bias towards a patch-wise uniform margin. Based on two examples, we demonstrate that patch-wise uniform margin classifiers can outperform the maximum margin classifiers in certain learning problems. Our results contribute to a better theoretical understanding of batch normalization.

Beyond Parallel Pancakes: Quasi-Polynomial Time Guarantees for Non-Spherical Gaussian Mixtures

Wed, 12 Jul 2023 00:00:00 +0000

We consider mixtures of k >= 2 Gaussian components with unknown means and unknown covariance (identical for all components) that are well-separated, i.e., distinct components have statistical overlap at most k^{-C} for a large enough constant C >= 1.Previous statistical-query [DKS17] and cryptographic [BRST21, GVV22] lower bounds give formal evidence that, even for the special case of colinear means, distinguishing such mixtures from (pure) Gaussians may be exponentially hard (in k).We show that, surprisingly, this kind of hardness can only appear if mixing weights are allowed to be exponentially small. For polynomially lower bounded mixing weights, we show how to achieve non-trivial statistical guarantees in quasi-polynomial time.Concretely, we develop an algorithm based on the sum-of-squares method with running time quasi-polynomial in the minimum mixing weight. The algorithm can reliably distinguish between a mixture of k >= 2 well-separated Gaussian components and a (pure) Gaussian distribution. As a certificate, the algorithm computes a bipartition of the input sample that separates some pairs of mixture components, i.e., both sides of the bipartition contain most of the sample points of at least one component.For the special case of colinear means, our algorithm outputs a k-clustering of the input sample that is approximately consistent with all components of the underlying mixture. We obtain similar clustering guarantees also for the case that the overlap between any two mixture components is lower bounded quasi-polynomially ink (in addition to being upper bounded polynomially in k).A significant challenge for our results is that they appear to be inherently sensitive to small fractions of adversarial outliers unlike most previous algorithmic results for Gaussian mixtures. The reason is that such outliers can simulate exponentially small mixing weights even for mixtures with polynomially lower bounded mixing weights.A key technical ingredient of our algorithms is a characterization of separating directions for well-separated Gaussian components in terms of ratios of polynomials that correspond to moments of two carefully chosen orders logarithmic in the minimum mixing weight.

Geodesically convex $M$-estimation in metric spaces

Wed, 12 Jul 2023 00:00:00 +0000

We study the asymptotic properties of geodesically convex $M$-estimation on non-linear spaces. Namely, we prove that under very minimal assumptions besides geodesic convexity of the cost function, one can obtain consistency and asymptotic normality, which are fundamental properties in statistical inference. Our results extend the Euclidean theory of convex $M$-estimation; They also generalize limit theorems on non-linear spaces which, essentially, were only known for barycenters, allowing to consider robust alternatives that are defined through non-smooth $M$-estimation procedures.

Improper Multiclass Boosting

Wed, 12 Jul 2023 00:00:00 +0000

We study the setting of multiclass boosting with a possibly large number of classes. A recent work by Brukhim, Hazan, Moran, and Schapire, 2021, proved a hardness result for a large class of natural boosting algorithms we call proper. These algorithms output predictors that correspond to a plurality-vote aggregation of weak hypotheses. In particular, they showed that proper boosting algorithms must incur a large cost that scales with the number of classes.In this work we propose an efficient improper multiclass boosting algorithm that circumvents this hardness result. A key component of our algorithm is based on the technique of list learning. In list learning, instead of predicting a single outcome for a given unseen input, the goal is to provide a short menu of predictions. The resulting boosting algorithm has sample and oracle complexity bounds that are entirely independent of the number of classes.A corollary of the above is that plurality-vote over a learnable class is also learnable. We complement this result by showing that other simple aggregations over hypotheses from a learnable class do not preserve learnability, unlike in the binary setting.

Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions

Wed, 12 Jul 2023 00:00:00 +0000

We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previously known to achieve this guarantee. Given $n$ samples from a (sub-)Gaussian distribution with unknown mean $\mu$ and covariance $\Sigma$, our $(\epsilon,\delta)$-differentially private estimator produces $\tilde{\mu}$ such that $\|\mu - \tilde{\mu}\|_{\Sigma} \leq \alpha$ as long as $n \gtrsim \tfrac d {\alpha^2} + \tfrac{d \sqrt{\log 1/\delta}}{\alpha \epsilon}+\frac{d\log 1/\delta}{\epsilon}$. The Mahalanobis error metric $\|\mu - \hat{\mu}\|_{\Sigma}$ measures the distance between $\hat \mu$ and $\mu$ relative to $\Sigma$; it characterizes the error of the sample mean. Our algorithm runs in time $\tilde{O}(nd^{\omega - 1} + nd/\eps)$, where $\omega < 2.38$ is the matrix multiplication exponent.We adapt an exponential-time approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above.Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With $n\gtrsim d^{3/2}$ samples, our estimate is accurate in spectral norm. This is the first such algorithm using $n= o(d^2)$ samples, answering an open question posed by Alabi et al. (2022). With $n\gtrsim d^2$ samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance.Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.

Detection-Recovery and Detection-Refutation Gaps via Reductions from Planted Clique

Wed, 12 Jul 2023 00:00:00 +0000

Planted Dense Subgraph (PDS) problem is a prototypical problem with a computational-statistical gap. It also exhibits an intriguing additional phenomenon: different tasks, such as detection or recovery, appear to have different computational limits. A detection-recovery gap for PDS was substantiated in the form of a precise conjecture given by Chen and Xu (2014) (based on the parameter values for which a convexified MLE succeeds), and then shown to hold for low-degree polynomial algorithms by Schramm and Wein (2022) and for MCMC algorithms for Ben Arous et al. (2020). In this paper we demonstrate that a slight variation of the Planted Clique Hypothesis with secret leakage (introduced in Brennan and Bresler (2020)), implies a detection-recovery gap for PDS. In the same vein, we also obtain a sharp lower bound for refutation, yielding a detection-refutation gap. Our methods build on the framework of Brennan and Bresler (2020) to construct average-case reductions mapping secret leakage Planted Clique to appropriate target problems.

Fine-Grained Distribution-Dependent Learning Curves

Wed, 12 Jul 2023 00:00:00 +0000

Learning curves plot the expected error of a learning algorithm as a function of the number of labeled samples it receives from a target distribution. They are widely used as a measure of an algorithm’s performance, but classic PAC learning theory cannot explain their behavior. As observed by Antos and Lugosi (1996, 1998), the classic ‘No Free Lunch’ lower bounds only trace the upper envelope above all learning curves of specific target distributions. For a concept class with VC dimension $d$ the classic bound decays like $d/n$, yet it is possible that the learning curve for \emph{every} specific distribution decays exponentially. In this case, for each $n$ there exists a different ‘hard’ distribution requiring $d/n$ samples. Antos and Lugosi asked which concept classes admit a ‘strong minimax lower bound’ – a lower bound of $d’/n$ that holds for a fixed distribution for infinitely many $n$.We solve this problem in a principled manner, by introducing a combinatorial dimension called VCL that characterizes the best $d’$ for which $d’/n$ is a strong minimax lower bound. Conceptually, the VCL dimension determines the asymptotic rate of decay of the minimax learning curve, which we call the ‘distribution-free trail’ of the class. Our characterization strengthens the lower bounds of Bousquet, Hanneke, Moran, van Handel, and Yehudayoff (2021), and it refines their analysis of learning curves, by showing that for classes with finite VCL the learning rate can be decomposed into a linear component that depends only on the hypothesis class and a faster (e.g., exponential) component that depends also on the target distribution. As a corollary, we recover the lower bound of Antos and Lugosi (1996, 1998) for half-spaces in $\mathbb{R}^d$.Finally, to provide another viewpoint on our work and how it compares to traditional PAC learning bounds, we also present an alternative formulation of our results in a language that is closer to the PAC setting.

Precise Asymptotic Analysis of Deep Random Feature Models

Wed, 12 Jul 2023 00:00:00 +0000

We provide exact asymptotic expressions for the performance of regression by an $L-$layer deep random feature (RF) model, where the input is mapped through multiple random embedding and non-linear activation functions. For this purpose, we establish two key steps: First, we prove a novel universality result for RF models and deterministic data, by which we demonstrate that a deep random feature model is equivalent to a deep linear Gaussian model that matches it in the first and second moments, at each layer. Second, we make use of the convex Gaussian Min-Max theorem multiple times to obtain the exact behavior of deep RF models. We further characterize the variation of the eigendistribution in different layers of the equivalent Gaussian model, demonstrating that depth has a tangible effect on model performance despite the fact that only the last layer of the model is being trained.

Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making

Wed, 12 Jul 2023 00:00:00 +0000

Smoothed online learning has emerged as a popular framework to mitigate the substantial loss in statistical and computational complexity that arises when one moves from classical to adversarial learning. Unfortunately, for some spaces, it has been shown that efficient algorithms suffer an exponentially worse regret than that which is minimax optimal, even when the learner has access to an optimization oracle over the space. To mitigate that exponential dependence, this work introduces a new notion of complexity, the generalized bracketing numbers, which marries constraints on the adversary to the size of the space, and shows that an instantiation of Follow-the-Perturbed-Leader can attain low regret with the number of calls to the optimization oracle scaling optimally with respect to average regret. We then instantiate our bounds in several problems of interest, including online prediction and planning of piecewise continuous functions, which has many applications in fields as diverse as econometrics and robotics.

The Sample Complexity of Approximate Rejection Sampling With Applications to Smoothed Online Learning

Wed, 12 Jul 2023 00:00:00 +0000

Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the outputdistributed as close as possible to a target distribution $\nu$. In this workwe show that the optimal total variation distance as a function of $n$ is givenby $\tilde\Theta(\frac{D}{f’(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in theseemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithmsstill hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniformover a function class and compare importance sampling with rejectionsampling.

Quadratic Memory is Necessary for Optimal Query Complexity in Convex Optimization: Center-of-Mass is Pareto-Optimal

Wed, 12 Jul 2023 00:00:00 +0000

We give query complexity lower bounds for convex optimization and the related feasibility problem. We show that quadratic memory is necessary to achieve the optimal oracle complexity for first-order convex optimization. In particular, this shows that center-of-mass cutting-planes algorithms in dimension $d$ which use $\tilde O(d^2)$ memory and $\tilde O(d)$ queries are Pareto-optimal for both convex optimization and the feasibility problem, up to logarithmic factors. Precisely, we prove that to minimize $1$-Lipschitz convex functions over the unit ball to $1/d^4$ accuracy, any deterministic first-order algorithms using at most $d^{2-\delta}$ bits of memory must make $\tilde\Omega(d^{1+\delta/3})$ queries, for any $\delta\in[0,1]$. For the feasibility problem, in which an algorithm only has access to a separation oracle, we show a stronger trade-off: for at most $d^{2-\delta}$ memory, the number of queries required is $\tilde\Omega(d^{1+\delta})$. This resolves a COLT 2019 open problem of Woodworth and Srebro.

Complexity of High-Dimensional Identity Testing with Coordinate Conditional Sampling

Wed, 12 Jul 2023 00:00:00 +0000

We study the identity testing problem for high-dimensional distributions. Given as input an explicit distribution $\mu$, an $\varepsilon>0$, and access to sampling oracle(s) for a hidden distribution $\pi$, the goal in identity testing is to distinguish whether the two distributions $\mu$ and $\pi$ are identical or are at least $\varepsilon$-far apart. When there is only access to full samples from the hidden distribution $\pi$, it is known that exponentially many samples (in the dimension) may be needed for identity testing, and hence previous works have studied identity testing with additional access to various “conditional” sampling oracles. We consider a significantly weaker conditional sampling oracle, which we call the Coordinate Oracle, and provide a computational and statistical characterization of the identity testing problem in this new model.We prove that if an analytic property known as approximate tensorization of entropy holds for an $n$-dimensional visible distribution $\mu$, then there is an efficient identity testing algorithm for any hidden distribution $\pi$ using $\widetilde{O}(n/\varepsilon)$ queries to the Coordinate Oracle. Approximate tensorization of entropy is a pertinent condition as recent works have established it for a large class of high-dimensional distributions. We also prove a computational phase transition: for a well-studied class of $n$-dimensional distributions, specifically sparse antiferromagnetic Ising models over $\{+1,-1\}^n$, we show that in the regime where approximate tensorization of entropy fails, there is no efficient identity testing algorithm unless RP=NP. We complement our results with a matching $\Omega(n/\varepsilon)$ statistical lower bound for the sample complexity of identity testing in the $\coorora$ model.

Minimax Instrumental Variable Regression and $L_2$ Convergence Guarantees without Identification or Closedness

Wed, 12 Jul 2023 00:00:00 +0000

In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms weak metrics (e.g., projected norm) rather than strong metrics (e.g., L_2 norm); or (3) imposing the so-called closedness condition that requires a certain conditional expectation operator to be sufficiently smooth. In this paper, we present the first method and analysis that can avoid all three limitations, while still permitting general function approximation. Specifically, we propose a new penalized minimax estimator that can converge to a fixed IV solution even when there are multiple solutions, and we derive a strong L_2 error rate for our estimator under lax conditions. Notably, this guarantee only needs a widely-used source condition and realizability assumptions, but not the so-called closedness condition. We argue that the source condition and the closedness condition are inherently conflicting, so relaxing the latter significantly improves upon the existing literature that requires both conditions. Our estimator can achieve this improvement because it builds on a novel formulation of the IV estimation problem as a constrained optimization problem.

Inference on Strongly Identified Functionals of Weakly Identified Functions

Wed, 12 Jul 2023 00:00:00 +0000

In a variety of applications, including nonparametric instrumental variable (NPIV) analysis, proximal causal inference under unmeasured confounding, and missing-not-at-random data with shadow variables, we are interested in inference on a continuous linear functional (e.g., average causal effects) of nuisance function (e.g., NPIV regression) defined by conditional moment restrictions. These nuisance functions are generally weakly identified, in that the conditional moment restrictions can be severely ill-posed as well as admit multiple solutions. This is sometimes resolved by imposing strong conditions that imply the function can be estimated at rates that make inference on the functional possible. In this paper, we study a novel condition for the functional to be strongly identified even when the nuisance function is not; that is, the functional is amenable to asymptotically-normal estimation at root-n-rates. The condition implies the existence of debiasing nuisance functions, and we propose penalized minimax estimators for both the primary and debiasing nuisance functions. The proposed nuisance estimators can accommodate flexible function classes, and importantly they can converge to fixed limits determined by the penalization regardless of the identifiability of the nuisances. We use the penalized nuisance estimators to form a debiased estimator for the functional of interest and prove its asymptotic normality under generic high-level conditions, which provide for asymptotically valid confidence intervals. We also illustrate our method in a novel partially linear proximal causal inference problem and a partially linear instrumental variable regression problem.

Differentially Private Algorithms for the Stochastic Saddle Point Problem with Optimal Rates for the Strong Gap

Wed, 12 Jul 2023 00:00:00 +0000

We show that convex-concave Lipschitz stochastic saddle point problems (also known as stochastic minimax optimization) can be solved under the constraint of $(\epsilon,\delta)$-differential privacy with \emph{strong (primal-dual) gap} rate of $\tilde O\big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d}}{n\epsilon}\big)$, where $n$ is the dataset size and $d$ is the dimension of the problem. This rate is nearly optimal, based on existing lower bounds in differentially private stochastic optimization. Specifically, we prove a tight upper bound on the strong gap via novel implementation and analysis of the recursive regularization technique repurposed for saddle point problems. We show that this rate can be attained with $O\big(\min\big\{\frac{n^2\epsilon^{1.5}}{\sqrt{d}}, n^{3/2}\big\}\big)$ gradient complexity, and $\tilde{O}(n)$ gradient complexity if the loss function is smooth. As a byproduct of our method, we develop a general algorithm that, given a black-box access to a subroutine satisfying a certain $\alpha$ primal-dual accuracy guarantee with respect to the empirical objective, gives a solution to the stochastic saddle point problem with a strong gap of $\tilde{O}(\alpha+\frac{1}{\sqrt{n}})$. We show that this $\alpha$-accuracy condition is satisfied by standard algorithms for the empirical saddle point problem such as the proximal point method and the stochastic gradient descent ascent algorithm. Finally, to emphasize the importance of the strong gap as a convergence criterion compared to the weaker notion of primal-dual gap, commonly known as the \emph{weak gap}, we show that even for simple problems it is possible for an algorithm to have zero weak gap and suffer from $\Omega(1)$ strong gap. We also show that there exists a fundamental tradeoff between stability and accuracy. Specifically, we show that any $\Delta$-stable algorithm has empirical gap $\Omega\big(\frac{1}{\Delta n}\big)$, and that this bound is tight. This result also holds also more specifically for empirical risk minimization problems and may be of independent interest.

On Testing and Learning Quantum Junta Channels

Wed, 12 Jul 2023 00:00:00 +0000

We consider the problems of testing and learning quantum k-junta channels, which are n-qubit to n-qubit quantum channels acting non-trivially on at most k out of n qubits and leaving the rest of qubits unchanged. We show the following.1. An \tilde{O}(k)-query algorithm to distinguish whether the given channel is k-junta channel or is far from any k-junta channels, and a lower bound \Omega(\sqrt{k}) on the number of queries;2. An \tilde{O}(4^k)-query algorithm to learn a k-junta channel, and a lower bound \Omega(4^k/k) on the number of queries.This gives the first junta channel testing and learning results, and partially answers an open problem raised by Chen et al. (2023). In order to settle these problems, we develop a Fourier analysis framework over the space of superoperators and prove several fundamental properties, which extends the Fourier analysis over the space of operators introduced in Montanaro and Osborne (2010).

Proper Losses, Moduli of Convexity, and Surrogate Regret Bounds

Wed, 12 Jul 2023 00:00:00 +0000

Proper losses (or proper scoring rules) have been used for over half a century to elicit users’ subjective probability from the observations. In the recent machine learning community, we often tackle downstream tasks such as classification and bipartite ranking with the elicited probabilities. Here, we engage in assessing the quality of the elicited probabilities with different proper losses, which can be characterized by surrogate regret bounds to describe the convergence speed of an estimated probability to the optimal one when optimizing a proper loss. This work contributes to a sharp analysis of surrogate regret bounds in two ways. First, we provide general surrogate regret bounds for proper losses measured by the $L^1$ distance. This abstraction eschews a tailor-made analysis of each downstream task and delineates how universally a loss function operates. Our analysis relies on a classical mathematical tool known as the moduli of convexity, which is of independent interest per se. Second, we evaluate the surrogate regret bounds with polynomials to identify the quantitative convergence rate. These devices enable us to compare different losses, with which we can confirm that the lower bound of the surrogate regret bounds is $\Omega(\epsilon^{1/2})$ for popular loss functions.

Multitask Learning via Shared Features: Algorithms and Hardness

Wed, 12 Jul 2023 00:00:00 +0000

We investigate the computational efficiency of multitask learning of Boolean functions over the $d$-dimensional hypercube, that are related by means of a feature representation of size $k\ll d$ shared across all tasks. We present a polynomial time multitask learning algorithm for the concept class of halfspaces with margin $\gamma$, which is based on a simultaneous boosting technique and requires only $\mathrm{poly}(k/\gamma)$ samples-per-task and $\mathrm{poly}(k\log(d)/\gamma)$ samples in total. In addition, we prove a computational separation, showing that assuming there exists a concept class that cannot be learned in the attribute-efficient model, we can construct another concept class such that can be learned in the attribute-efficient model, but cannot be multitask learned efficiently—multitask learning this concept class either requires super-polynomial time complexity or a much larger total number of samples.

Open Problem: The Sample Complexity of Multi-Distribution Learning for VC Classes

Wed, 12 Jul 2023 00:00:00 +0000

Multi-distribution learning is a natural generalization of PAC learning to settings with multiple data distributions. There remains a significant gap between the known upper and lower bounds for PAC-learnable classes. In particular, though we understand the sample complexity of learning a VC dimension $d$ class on $k$ distributions to be $O(\epsilon^{-2} \ln(k) (d + k) + \min \{\epsilon^{-1} d k, \epsilon^{-4} \ln(k) d\})$, the best lower bound is $\Omega(\epsilon^{-2}(d + k \ln(k)))$. We discuss recent progress on this problem and some hurdles that are fundamental to the use of game dynamics in statistical learning.

Online Learning and Solving Infinite Games with an ERM Oracle

Wed, 12 Jul 2023 00:00:00 +0000

While ERM suffices to attain near-optimal generalization error in the stochastic learning setting, this is not known to be the case in the online learning setting, where algorithms for general concept classes rely on computationally inefficient oracles such as the Standard Optimal Algorithm (SOA). In this work, we propose an algorithm for online binary classification setting that relies solely on ERM oracle calls, and show that it has finite regret in the realizable setting and sublinearly growing regret in the agnostic setting. We bound the regret in terms of the Littlestone and threshold dimensions of the underlying concept class.We obtain similar results for nonparametric games, where the ERM oracle can be interpreted as a best response oracle, finding the best response of a player to a given history of play of the other players. In this setting, we provide learning algorithms that only rely on best response oracles and converge to approximate-minimax equilibria in two-player zero-sum games and approximate coarse correlated equilibria in multi-player general-sum games, as long as the game has bounded fat-threshold dimension. Our algorithms apply to both binary-valued and real-valued games and can be viewed as providing justification for the wide use of double oracle and multiple oracle algorithms in the practice of solving large games.

Private Online Prediction from Experts: Separations and Faster Rates

Wed, 12 Jul 2023 00:00:00 +0000

Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\wt O(\sqrt{T \log d} + \log d/\eps)$ for the stochastic setting and $\wt O(\sqrt{T \log d} + T^{1/3} \log d/\eps)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.

Statistical-Computational Tradeoffs in Mixed Sparse Linear Regression

Wed, 12 Jul 2023 00:00:00 +0000

We consider the problem of mixed sparse linear regression with two components, where two k-sparse signals β_1, β_2 ∈ R^p are to be recovered from n unlabelled noisy linear measurements. The sparsity is allowed to be sublinear in the dimension (k = o(p)), and the additive noise is assumed to be independent Gaussian with variance σ^2. Prior work has shown that the problem suffers from a k/SNR^2 -to- k^2/SNR^2 statistical-to-computational gap, resembling other computationally challenging high- dimensional inference problems such as Sparse PCA and Robust Sparse Mean Estimation (Brennan and Bresler, 2020b); here SNR := ∥β1∥^2/σ^2 = ∥β2∥^2/σ^2 is the signal-to-noise ratio. We establish the existence of a more extensive k/SNR^2 -to- k^2 (SNR+1)^2/SNR^2 computational barrier for this problem through the method of low-degree polynomials, but show that the problem is computationally hard only in a very narrow symmetric parameter regime. We identify a smooth information-computation tradeoff between the sample complexity n and runtime exp( Θ(k^2(SNR + 1)^2/(nSNR^2))) for any randomized algorithm in this hard regime. Via a simple reduction, this provides novel rigorous evidence for the existence of a computational barrier to solving exact support recovery in sparse phase retrieval with sample complexity n = o(k^2). Our second contribution is to analyze a simple thresholding algorithm which, outside of the narrow regime where the problem is hard, solves the associated mixed regression detection problem in O(np) time and matches the sample complexity required for (non-mixed) sparse linear regression of k(SNR+1)/SNR log p; this allows the recovery problem to be subsequently solved by state-of-the-art techniques from the dense case. As a special case of our results, we show that this simple algorithm is order-optimal among a large family of algorithms in solving exact signed support recovery in sparse linear regression. To the best of our knowledge, this is the first thorough study of the interplay between mixture symmetry, signal sparsity, and their joint impact on the computational hardness of mixed sparse linear regression.

From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks

Wed, 12 Jul 2023 00:00:00 +0000

This manuscript investigates the one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels generated by a similar, though not necessarily identical, target function. We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk. Our unifying analysis bridges different regimes of interest, such as the classical gradient-flow regime of vanishing learning rate, the high-dimensional regime of large input dimension, and the overparameterised “mean-field” regime of large network width, covering as well the intermediate regimes where the limiting dynamics is determined by the interplay between these behaviours. In particular, in the high-dimensional limit, the infinite-width dynamics is found to remain close to a low-dimensional subspace spanned by the target principal directions. Our results therefore provide a unifying picture of the limiting SGD dynamics with synthetic data.

Intrinsic dimensionality and generalization properties of the R-norm inductive bias

Wed, 12 Jul 2023 00:00:00 +0000

We study the structural and statistical properties of R-norm minimizing interpolants of datasets labeled by specific target functions. The R-norm is the basis of an inductive bias for two-layer neural networks, recently introduced to capture the functional effect of controlling the size of network weights, independently of the network width. We find that these interpolants are intrinsically multivariate functions, even when there are ridge functions that fit the data, and also that the R-norm inductive bias is not sufficient for achieving statistically optimal generalization for certain learning problems. Altogether, these results shed new light on an inductive bias that is connected to practical neural network training.

Resolving the Mixing Time of the Langevin Algorithm to its Stationary Distribution for Log-Concave Sampling

Wed, 12 Jul 2023 00:00:00 +0000

Sampling from a high-dimensional distribution is a fundamental task in statistics, engineering, and the sciences. A canonical approach is the Langevin Algorithm, i.e., the Markov chain for the discretized Langevin Diffusion. This is the sampling analog of Gradient Descent. Despite being studied for several decades in multiple communities, tight mixing bounds for this algorithm remain unresolved even in the seemingly simple setting of log-concave distributions over a bounded domain. This paper completely characterizes the mixing time of the Langevin Algorithm to its stationary distribution in this setting (and others). This mixing result can be combined with any bound on the discretization bias in order to sample from the stationary distribution of the Langevin Diffusion. In this way, we disentangle the study of the mixing and bias of the Langevin Algorithm.Our key insight is to analyze the Langevin Algorithm’s convergence by using a new Lyapunov function: the shifted divergence, a quantity studied in the differential privacy literature. Briefly, this Lyapunov function is a version of the Renyi divergence that is smoothed in optimal transport distance, and we use the amount of smoothing to measure the Langevin Algorithm’s progress. In addition to giving a short, simple proof of optimal mixing bounds, this analysis approach also has several additional appealing properties. First, our approach removes all unnecessary assumptions required by other sampling analyses. Second, our approach unifies many settings: it extends if the Langevin Algorithm uses projections, stochastic mini-batch gradients, or strongly convex potentials (whereby our mixing time improves exponentially). Third, our approach unifies many metrics: it proves mixing in the stringent notion of Renyi divergence, which immediately implies mixing in all common metrics via standard comparison inequalities. Fourth, our approach exploits convexity only through the contractivity of a gradient step—reminiscent of how convexity is used in textbook proofs of Gradient Descent.

Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning

Wed, 12 Jul 2023 00:00:00 +0000

Deep learning is also known as hierarchical learning, where the learner $\textit{learns}$ to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning $\textit{efficiently}$ and $\textit{automatically}$ by applying stochastic gradient descent (SGD) or its variants on the training objective.On the conceptual side, we present a theoretical characterizations of how certain types of deep (i.e. super-constantly many layers) neural networks can still be sample and time efficiently trained on some hierarchical learning tasks, when no known existing algorithm (including layerwise training, kernel method, etc) is efficient. We establish a new principle called “backward feature correction”, where \emph{the errors in the lower-level features can be automatically corrected when training together with the higher-level layers}. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layerwise learning or simulating some known non-hierarchical method.

Active Coverage for PAC Reinforcement Learning

Wed, 12 Jul 2023 00:00:00 +0000

Collecting and leveraging data with good coverage properties plays a crucial role in different aspects of reinforcement learning (RL), including reward-free exploration and offline learning. However, the notion of “good coverage” really depends on the application at hand, as data suitable for one context may not be so for another. In this paper, we formalize the problem of \emph{active coverage} in episodic Markov decision processes (MDPs), where the goal is to interact with the environment so as to fulfill given sampling requirements. This framework is sufficiently flexible to specify any desired coverage property, making it applicable to any problem that involves online exploration. Our main contribution is an \emph{instance-dependent} lower bound on the sample complexity of active coverage and a simple game-theoretic algorithm, CovGame, that nearly matches it. We then show that CovGame can be used as a building block to solve different PAC RL tasks. In particular, we obtain a simple algorithm for PAC reward-free exploration with an instance-dependent sample complexity that, in certain MDPs which are “easy to explore”, is lower than the minimax one. By further coupling this exploration algorithm with a new technique to do implicit eliminations in policy space, we obtain a computationally-efficient algorithm for best-policy identification whose instance-dependent sample complexity scales with gaps between policy values.

Differentially Private and Lazy Online Convex Optimization

Wed, 12 Jul 2023 00:00:00 +0000

We study the task of differentially private online convex optimization (OCO). In the online setting, the release of each distinct decision or iterate carries with it the potential for privacy loss. To limit such privacy leakage, we design an optimization-based OCO algorithm that explicitly limits the number of switches via objective perturbation and rejection sampling. This improves over known results in multiple aspects: an optimal leading-order regret term, in being efficiently implementable without requiring log-concave sampling subroutines, and in matching the non-private regret bound for sub-constant regimes of privacy parameters. Leveraging the fact that the algorithm is designed to explicitly minimize the number of switches of decisions, we show that the algorithm also obtains optimal regret bounds in the lazy OCO setting, where the learner is constrained to perform a limited number of switches. In addition, for one- and two-dimensional decision sets, we present a novel approach for differentially private online Lipschitz learning, where the loss functions are Lipschitz but not necessarily convex, that achieves the optimal regret bound matching known lower bounds.

Causal Matrix Completion

Wed, 12 Jul 2023 00:00:00 +0000

Matrix completion is the study of recovering an underlying matrix from a sparse subset of noisy observations. Traditionally, it is assumed that the entries of the matrix are “missing completely at random” (MCAR), i.e., each entry is revealed at random, independent of everything else, with uniform probability. This is likely unrealistic due to the presence of “latent confounders”, i.e., unobserved factors that determine both the entries of the underlying matrix and the missingness pattern in the observed matrix. For example, in the context of movie recommender systems—a canonical application for matrix completion—a user who vehemently dislikes horror films is unlikely to ever watch horror films. In general, these confounders yield “missing not at random” (MNAR) data, which can severely impact any inference procedure that does not correct for this bias. We develop a formal causal model for matrix completion through the language of potential outcomes, and provide novel identification arguments for a variety of causal estimands of interest. We design a procedure, which we call “synthetic nearest neighbors” (SNN), to estimate these causal estimands. We prove finite-sample consistency and asymptotic normality of our estimator. Our analysis also leads to new theoretical results for the matrix completion literature. In particular, we establish entry-wise, i.e., max-norm, finite-sample consistency and asymptotic normality results for matrix completion with MNAR data. As a special case, this also provides entry-wise bounds for matrix completion with MCAR data. Across simulated and real data, we demonstrate the efficacy of our proposed estimator.

Provable Benefits of Representational Transfer in Reinforcement Learning

Wed, 12 Jul 2023 00:00:00 +0000

We study the problem of representational transfer in RL, where an agent first pretrains in a number of \emph{source tasks} to discover a shared representation, which is subsequently used to learn a good policy in a \emph{target task}. We propose a new notion of task relatedness between source and target tasks, and develop a novel approach for representational transfer under this assumption. Concretely, we show that given a generative access to source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy in the target task. The sample complexity is close to knowing the ground truth features in the target task, and comparable to prior representation learning results in the source tasks. We complement our positive results with lower bounds without generative access, and validate our findings with empirical evaluation on rich observation MDPs that require deep exploration. In our experiments, we observe speed up in learning in the target by pre-training, and also validate the need for generative access in source tasks.

VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

Wed, 12 Jul 2023 00:00:00 +0000

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

The One-Inclusion Graph Algorithm is not Always Optimal

Wed, 12 Jul 2023 00:00:00 +0000

The one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth achieves an optimal in-expectation risk bound in the standard PAC classification setup. In one of the first COLT open problems, Warmuth conjectured that this prediction strategy always implies an optimal high probability bound on the risk, and hence is also an optimal PAC algorithm. We refute this conjecture in the strongest sense: for any practically interesting Vapnik-Chervonenkis class, we provide an in-expectation optimal one-inclusion graph algorithm whose high probability risk bound cannot go beyond that implied by Markov’s inequality. Our construction of these poorly performing one-inclusion graph algorithms uses Varshamov-Tenengolts error correcting codes. Our negative result has several implications. First, it shows that the same poor high-probability performance is inherited by several recent prediction strategies based on generalizations of the one-inclusion graph algorithm. Second, our analysis shows yet another statistical problem that enjoys an estimator that is provably optimal in expectation via a leave-one-out argument, but fails in the high-probability regime. This discrepancy occurs despite the boundedness of the binary loss for which arguments based on concentration inequalities often provide sharp high probability risk bounds.

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Wed, 12 Jul 2023 00:00:00 +0000

We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure,{\it the leap}, which measures how “hierarchical” target functions are. For $d$-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function $f$ with low-dimensional support is $$\Tilde \Theta (d^{\max(\mathrm{Leap}(f),2)}) \,\,.$$ We prove a version of this conjecture for a class of functions on Gaussian isotropic data and 2-layer neural networks, under additional technical assumptions on how SGD is run. We show that the training sequentially learns the function support with a saddle-to-saddle dynamic. Our result departs from Abbe et al.’22 by going beyond leap 1 (merged-staircase functions), and by going beyond the mean-field and gradient flow approximations that prohibit the full complexity control obtained here.Finally, we note that this gives an SGD complexity for the full training trajectory that matches that of Correlational Statistical Query (CSQ) lower-bounds.