Proceedings of Machine Learning Research

Efficient active learning of sparse halfspaces

Tue, 03 Jul 2018 00:00:00 +0000

We study the problem of efficient PAC active learning of homogeneous linear classifiers (halfspaces) in $\mathbb{R}^d$, where the goal is to learn a halfspace with low error using as few label queries as possible. Under the extra assumption that there is a $t$-sparse halfspace that performs well on the data ($t \ll d$), we would like our active learning algorithm to be {\em attribute efficient}, i.e. to have label requirements sublinear in $d$. In this paper, we provide a computationally efficient algorithm that achieves this goal. Under certain distributional assumptions on the data, our algorithm achieves a label complexity of $O(t \cdot \mathrm{polylog}(d, \frac 1 \epsilon))$. In contrast, existing algorithms in this setting are either computationally inefficient, or subject to label requirements polynomial in $d$ or $\frac 1 \epsilon$.

An Estimate Sequence for Geodesically Convex Optimization

Tue, 03 Jul 2018 00:00:00 +0000

We propose a Riemannian version of Nesterov’s Accelerated Gradient algorithm (\textsc{Ragd}), and show that for \emph{geodesically} smooth and strongly convex problems, within a neighborhood of the minimizer whose radius depends on the condition number as well as the sectional curvature of the manifold, \textsc{Ragd} converges to the minimizer with acceleration. Unlike the algorithm in (Liu et al., 2017) that requires the exact solution to a nonlinear equation which in turn may be intractable, our algorithm is constructive and computationally tractable. Our proof exploits a new estimate sequence and a novel bound on the nonlinear metric distortion, both ideas may be of independent interest.

Optimal approximation of continuous functions by very deep ReLU networks

Tue, 03 Jul 2018 00:00:00 +0000

We consider approximations of general continuous functions on finite-dimensional cubes by general deep ReLU neural networks and study the approximation rates with respect to the modulus of continuity of the function and the total number of weights $W$ in the network. We establish the complete phase diagram of feasible approximation rates and show that it includes two distinct phases. One phase corresponds to slower approximations that can be achieved with constant-depth networks and continuous weight assignments. The other phase provides faster approximations at the cost of depths necessarily growing as a power law $L\sim W^{\alpha}, 0<\alpha\le 1,$ and with necessarily discontinuous weight assignments. In particular, we prove that constant-width fully-connected networks of depth $L\sim W$ provide the fastest possible approximation rate $\|f-\widetilde f\|_\infty = O(\omega_f(O(W^{-2/\nu})))$ that cannot be achieved with less deep networks.

Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem

Tue, 03 Jul 2018 00:00:00 +0000

We study sampling as optimization in the space of measures. We focus on gradient flow-based optimization with the Langevin dynamics as a case study. We investigate the source of the bias of the unadjusted Langevin algorithm (ULA) in discrete time, and consider how to remove or reduce the bias. We point out the difficulty is that the heat flow is exactly solvable, but neither its forward nor backward method is implementable in general, except for Gaussian data. We propose the symmetrized Langevin algorithm (SLA), which should have a smaller bias than ULA, at the price of implementing a proximal gradient step in space. We show SLA is in fact consistent for Gaussian target measure, whereas ULA is not. We also illustrate various algorithms explicitly for Gaussian target measure with Gaussian data, including gradient descent, proximal gradient, and Forward-Backward, and show they are all consistent.

More Adaptive Algorithms for Adversarial Bandits

Tue, 03 Jul 2018 00:00:00 +0000

We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order path-length of only the best arm; 3) a regret bound depending on the sum of the first-order path-lengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss {\it and} logarithmic regret when there exists an arm whose expected loss is always smaller than those of other arms by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameter-free. The main idea of our algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.

An explicit analysis of the entropic penalty in linear programming

Tue, 03 Jul 2018 00:00:00 +0000

Solving linear programs by using entropic penalization has recently attracted new interest in the optimization community, since this strategy forms the basis for the fastest-known algorithms for the optimal transport problem, with many applications in modern large-scale machine learning. Crucial to these applications has been an analysis of how quickly solutions to the penalized program approach true optima to the original linear program. More than 20 years ago, Cominetti and San Martín showed that this convergence is exponentially fast; however, their proof is asymptotic and does not give any indication of how accurately the entropic program approximates the original program for any particular choice of the penalization parameter. We close this long-standing gap in the literature regarding entropic penalization by giving a new proof of the exponential convergence, valid for any linear program. Our proof is non-asymptotic, yields explicit constants, and has the virtue of being extremely simple. We provide matching lower bounds and show that the entropic approach does not lead to a near-linear time approximation scheme for the linear assignment problem.

Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability

Tue, 03 Jul 2018 00:00:00 +0000

We study the detailed path-wise behavior of the discrete-time Langevin algorithm for non-convex Empirical Risk Minimization (ERM) through the lens of metastability, adopting some techniques from Berglund and Gentz (2003). For a particular local optimum of the empirical risk, with an \textit{arbitrary initialization}, we show that, with high probability, at least one of the following two events will occur: (1) the Langevin trajectory ends up somewhere outside the $\varepsilon$-neighborhood of this particular optimum within a short \textit{recurrence time}; (2) it enters this $\varepsilon$-neighborhood by the recurrence time and stays there until a potentially exponentially long \textit{escape time}. We call this phenomenon \textit{empirical metastability}. This two-timescale characterization aligns nicely with the existing literature in the following two senses. First, the effective recurrence time (i.e., number of iterations multiplied by stepsize) is dimension-independent, and resembles the convergence time of continuous-time deterministic Gradient Descent (GD). However unlike GD, the Langevin algorithm does not require strong conditions on local initialization, and has the possibility of eventually visiting all optima. Second, the scaling of the escape time is consistent with the Eyring-Kramers law, which states that the Langevin scheme will eventually visit all local minima, but it will take an exponentially long time to transit among them. We apply this path-wise concentration result in the context of statistical learning to examine local notions of generalization and optimality.

Private Sequential Learning

Tue, 03 Jul 2018 00:00:00 +0000

We formulate a private learning model to study an intrinsic tradeoff between privacy and query complexity in sequential learning. Our model involves a learner who aims to determine a scalar value, $v^*$, by sequentially querying an external database and receiving binary responses. In the meantime, an adversary observes the learner’s queries, though not the responses, and tries to infer from them the value of $v^*$. The objective of the learner is to obtain an accurate estimate of $v^*$ using only a small number of queries, while simultaneously protecting her privacy by making $v^*$ provably difficult to learn for the adversary. Our main results provide tight upper and lower bounds on the learner’s query complexity as a function of desired levels of privacy and estimation accuracy. We also construct explicit query strategies whose complexity is optimal up to an additive constant.

Averaging Stochastic Gradient Descent on Riemannian Manifolds

Tue, 03 Jul 2018 00:00:00 +0000

We consider the minimization of a function defined on a Riemannian manifold $\mathcal{M}$ accessible only through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on $\mathcal{M}$ to an averaged iterate sequence with a robust and fast $O(1/n)$ convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we demonstrate how these ideas apply to the case of streaming $k$-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.

Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods

Tue, 03 Jul 2018 00:00:00 +0000

The problem of Non-Gaussian Component Analysis (NGCA) is about finding a maximal low-dimensional subspace $E$ in $\mathbb{R}^n$ so that data points projected onto $E$ follow a non-Gaussian distribution. Vempala and Xiao (2011) proposed a local search algorithm, and showed that it was able to estimate $E$ accurately with polynomial time and sample complexity, if the dimension of $E$ is treated as a constant and with the assumption that all one-dimensional marginals of the non-Gaussian distribution over $E$ have non-Gaussian moments. In this paper, we propose a simple spectral algorithm called \textsc{Reweighted PCA}, and prove that it possesses the same guarantee. The principle that underlies this approach is a new characterization of multivariate Gaussian distributions.

Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification

Tue, 03 Jul 2018 00:00:00 +0000

We prove that the ordinary least-squares (OLS) estimator attains nearly minimax optimal performance for the identification of linear dynamical systems from a single observed trajectory. Our upper bound relies on a generalization of Mendelson’s small-ball method to dependent data, eschewing the use of standard mixing-time arguments. Our lower bounds reveal that these upper bounds match up to logarithmic factors. In particular, we capture the correct signal-to-noise behavior of the problem, showing that \emph{more unstable} linear systems are \emph{easier} to estimate. This behavior is qualitatively different from arguments which rely on mixing-time calculations that suggest that unstable systems are more difficult to estimate. We generalize our technique to provide bounds for a more general class of linear response time-series.

Learning Patterns for Detection with Multiscale Scan Statistics

Tue, 03 Jul 2018 00:00:00 +0000

This paper addresses detecting anomalous patterns in images, time-series, and tensor data when the location and scale of the pattern and the pattern itself is unknown a priori. The multiscale scan statistic convolves the proposed pattern with the image at various scales and returns the maximum of the resulting tensor. Scale corrected multiscale scan statistics apply different standardizations at each scale, and the limiting distribution under the null hypothesis—that the data is only noise—is known for smooth patterns. We consider the problem of simultaneously learning and detecting the anomalous pattern from a dictionary of smooth patterns and a database of many tensors. To this end, we show that the multiscale scan statistic is a subexponential random variable, and prove a chaining lemma for standardized suprema, which may be of independent interest. Then by averaging the statistics over the database of tensors we can learn the pattern and obtain Bernstein-type error bounds. We will also provide a construction of an $\epsilon$-net of the location and scale parameters, providing a computationally tractable approximation with similar error bounds.

An Optimal Learning Algorithm for Online Unconstrained Submodular Maximization

Tue, 03 Jul 2018 00:00:00 +0000

We consider a basic problem at the interface of two fundamental fields: {\em submodular optimization} and {\em online learning}. In the {\em online unconstrained submodular maximization (online USM) problem}, there is a universe $[n]=\{1,2,\ldots,n\}$ and a sequence of $T$ nonnegative (not necessarily monotone) submodular functions arrive over time. The goal is to design a computationally efficient online algorithm, which chooses a subset of $[n]$ at each time step as a function only of the past, such that the accumulated value of the chosen subsets is as close as possible to the maximum total value of a fixed subset in hindsight. Our main result is a polynomial-time no-$\frac12$-regret algorithm for this problem, meaning that for every sequence of nonnegative submodular functions, the algorithm’s expected total value is at least $\frac12$ times that of the best subset in hindsight, up to an error term sublinear in $T$. The factor of $\tfrac 12$ cannot be improved upon by any polynomial-time online algorithm when the submodular functions are presented as value oracles. Previous work on the offline problem implies that picking a subset uniformly at random in each time step achieves zero $\frac14$-regret. A byproduct of our techniques is an explicit subroutine for the two-experts problem that has an unusually strong regret guarantee: the total value of its choices is comparable to twice the total value of either expert on rounds it did not pick that expert. This subroutine may be of independent interest.

The Externalities of Exploration and How Data Diversity Helps Exploitation

Tue, 03 Jul 2018 00:00:00 +0000

Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users in order to gain information that will lead to better decisions in the future. Recently, concerns have been raised about whether the process of exploration could be viewed as unfair, placing too much burden on certain individuals or groups. Motivated by these concerns, we initiate the study of the externalities of exploration—the undesirable side effects that the presence of one party may impose on another—under the linear contextual bandits model. We introduce the notion of a group externality, measuring the extent to which the presence of one population of users (the majority) impacts the rewards of another (the minority). We show that this impact can, in some cases, be negative, and that, in a certain sense, no algorithm can avoid it. We then move on to study externalities at the individual level, interpreting the act of exploration as an externality imposed on the current user of a system by future users. This drives us to ask under what conditions inherent diversity in the data makes explicit exploration unnecessary. We build on a recent line of work on the smoothed analysis of the greedy algorithm that always chooses the action that currently looks optimal. We improve on prior results to show that a greedy approach almost matches the best possible Bayesian regret rate of any other algorithm on the same problem instance whenever the diversity conditions hold, and that this regret is at most $\tilde{O}(T^{1/3})$. Returning to group-level effects, we show that under the same conditions, negative group externalities essentially vanish if one runs the greedy algorithm. Together, our results uncover a sharp contrast between the high externalities that exist in the worst case, and the ability to remove all externalities if the data is sufficiently diverse.

Exponential Convergence of Testing Error for Stochastic Gradient Methods

Tue, 03 Jul 2018 00:00:00 +0000

We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing \emph{loss} (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing \emph{error} (classification error) converges exponentially fast if low-noise conditions are assumed. To achieve these rates of convergence we show sharper high-probability bounds with respect to the number of observations for stochastic gradient descent.

Iterate Averaging as Regularization for Stochastic Gradient Descent

Tue, 03 Jul 2018 00:00:00 +0000

We propose and analyze a variant of the classic Polyak–Ruppert averaging scheme, broadly used in stochastic gradient methods. Rather than a uniform average of the iterates, we consider a weighted average, with weights decaying in a geometric fashion. In the context of linear least-squares regression, we show that this averaging scheme has the same regularizing effect, and indeed is asymptotically equivalent, to ridge regression. In particular, we derive finite-sample bounds for the proposed approach that match the best known results for regularized stochastic gradient methods.

A Direct Sum Result for the Information Complexity of Learning

Tue, 03 Jul 2018 00:00:00 +0000

How many bits of information are required to PAC learn a class of hypotheses of VC dimension $d$? The mathematical setting we follow is that of Bassily et al., where the value of interest is the mutual information $\mathrm{I}(S;A(S))$ between the input sample $S$ and the hypothesis outputted by the learning algorithm $A$. We introduce a class of functions of VC dimension $d$ over the domain $\mathcal{X}$ with information complexity at least $\Omega \left(d\log \log \frac{|\mathcal{X}|}{d}\right)$ bits for any consistent and proper algorithm (deterministic or random). Bassily et al. proved a similar (but quantitatively weaker) result for the case $d=1$. The above result is in fact a special case of a more general phenomenon we explore. We define the notion of {\em information complexity} of a given class of functions $\cH$. Intuitively, it is the minimum amount of information that an algorithm for $\mathcal{X}$ must retain about its input to ensure consistency and properness. We prove a direct sum result for information complexity in this context; roughly speaking, the information complexity sums when combining several classes.

Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints

Tue, 03 Jul 2018 00:00:00 +0000

We study the generalization errors of \emph{non-convex} regularized ERM procedures using Stochastic Gradient Langevin Dynamics (SGLD). Two theories are proposed with non-asymptotic discrete-time analysis, using stability and PAC-Bayesian theory respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_N}\right)$, where $L$ is Lipschitz parameter, $\beta$ is inverse temperature, and $T_N$ is the sum of step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step decays exponentially through time, and the uniform Lipschitz constant is also replaced by actual norms of gradients along the optimization trajectory. Our bounds have reasonable dependence on aggregated step sizes, and do not explicitly depend on dimensions, norms or other capacity measures of the parameter. The bounds characterize how the noises in the algorithm itself controls the statistical learning behavior in non-convex problems, without uniform convergence in the hypothesis space, which sheds light on the effect of training algorithms on the generalization error for deep neural networks.

Fundamental Limits of Weak Recovery with Applications to Phase Retrieval

Tue, 03 Jul 2018 00:00:00 +0000

In phase retrieval we want to recover an unknown signal $\boldsymbol x\in\mathbb C^d$ from $n$ quadratic measurements of the form $y_i = |⟨\boldsymbol a_i,\boldsymbol x⟩|^2+w_i$ where $\boldsymbol a_i\in \mathbb C^d$ are known sensing vectors and $w_i$ is measurement noise. We ask the following \emph{weak recovery} question: what is the minimum number of measurements $n$ needed to produce an estimator $\hat{\boldsymbol x}(\boldsymbol y)$ that is positively correlated with the signal $\boldsymbol x$? We consider the case of Gaussian vectors $\boldsymbol a_i$. We prove that – in the high-dimensional limit – a sharp phase transition takes place, and we locate the threshold in the regime of vanishingly small noise. For $n\le d-o(d)$ no estimator can do significantly better than random and achieve a strictly positive correlation. For $n\ge d+o(d)$ a simple spectral estimator achieves a positive correlation. Surprisingly, numerical simulations with the same spectral estimator demonstrate promising performance with realistic sensing matrices. Spectral methods are used to initialize non-convex optimization algorithms in phase retrieval, and our approach can boost the performance in this setting as well. Our impossibility result is based on classical information-theory arguments. The spectral algorithm computes the leading eigenvector of a weighted empirical covariance matrix. We obtain a sharp characterization of the spectral properties of this random matrix using tools from free probability and generalizing a recent result by Lu and Li. Both the upper and lower bound generalize beyond phase retrieval to measurements $y_i$ produced according to a generalized linear model. As a byproduct of our analysis, we compare the threshold of the proposed spectral method with that of a message passing algorithm.

A Data Prism: Semi-verified learning in the small-alpha regime

Tue, 03 Jul 2018 00:00:00 +0000

We consider a simple model of unreliable or crowdsourced data where there is an underlying set of $n$ binary variables, each “evaluator” contributes a (possibly unreliable or adversarial) estimate of the values of some subset of $r$ of the variables, and the learner is given the true value of a \emph{constant} number of variables. We show that, provided an $\alpha$-fraction of the evaluators are “good” (either correct, or with independent noise rate $p < 1/2$), then the true values of a $(1-\eps)$ fraction of the $n$ underlying variables can be deduced as long as $r > \log_{2-2p}(1/\alpha)$. For example, if the fraction of “good” evaluators is larger than $1/16$ and there is no noise in their responses, then accurate recovery is possible provided each worker evaluates a random set of $4$ items. This result is optimal in that if $r \leq \log_{2-2p}(1/\alpha)$ the large dataset can contain no information. This setting can be viewed as an instance of the \emph{semi-verified} learning model introduced by Charikar, Steinhardt, and Valiant, which explores the tradeoff between the number of items evaluated by each worker and the fraction of “good” evaluators. In the standard adversarial setting, our algorithm requires $\tilde{O}\left(n^{\log_{2-2p}(1/\alpha)}\right)$ evaluators. However, the algorithm runs in near linear time, $\tilde{O}_{r,\eps}(n)$, and hence would require only a near-linear number of evaluations in the weaker model in which the adversary’s responses to each $r$-tuple of items are independent of the set of evaluations collected. These settings and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment. This extreme parameter regime, where the fraction of reliable data is small (inverse exponential in the amount of data provided by each source), is relevant to a number of practical settings. For example, the setting where you collect a dataset on customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a \emph{constant} (polynomial in $1/\alpha$ but independent of $n$), number of randomly selected items, to recover an accurate estimate of the entire set of preferences, even if the fraction of the original dataset contributed by the demographic of interest is inverse exponential in the number of preferences supplied by each customer. In this sense, our results can be viewed as a “data prism” allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.

Empirical bounds for functions with weak interactions

Tue, 03 Jul 2018 00:00:00 +0000

We provide sharp empirical estimates of expectation, variance and normal approximation for a class of statistics whose variation in any argument does not change too much when another argument is modified. Examples of such weak interactions are furnished by U- and V-statistics, Lipschitz L-statistics and various error functionals of $\ell_2$-regularized algorithms and Gibbs algorithms.

Breaking the $1/\sqrt{n}$ Barrier: Faster Rates for Permutation-based Models in Polynomial Time

Tue, 03 Jul 2018 00:00:00 +0000

Many applications, including rank aggregation and crowd-labeling, can be modeled in terms of a bivariate isotonic matrix with unknown permutations acting on its rows and columns. We consider the problem of estimating such a matrix based on noisy observations of a subset of its entries, and design and analyze a polynomial-time algorithm that improves upon the state of the art. In particular, our results imply that any such $n \times n$ matrix can be estimated efficiently in the normalized Frobenius norm at rate $\widetilde{\mathcal O}(n^{-3/4})$, thus narrowing the gap between $\widetilde{\mathcal O}(n^{-1})$ and $\widetilde{\mathcal O}(n^{-1/2})$, which were hitherto the rates of the most statistically and computationally efficient methods, respectively.

Convex Optimization with Unbounded Nonconvex Oracles using Simulated Annealing

Tue, 03 Jul 2018 00:00:00 +0000

We consider the problem of minimizing a convex objective function $F$ when one can only evaluate its noisy approximation $\hat{F}$. Unless one assumes some structure on the noise, $\hat{F}$ may be an arbitrary nonconvex function, making the task of minimizing $F$ intractable. To overcome this, prior work has often focused on the case when $F(x)-\hat{F}(x)$ is uniformly-bounded. In this paper we study the more general case when the noise has magnitude $\alpha F(x) + \beta$ for some $\alpha, \beta > 0$, and present a polynomial time algorithm that finds an approximate minimizer of $F$ for this noise model. Previously, Markov chains, such as the stochastic gradient Langevin dynamics, have been used to arrive at approximate solutions to these optimization problems. However, for the noise model considered in this paper, no single temperature allows such a Markov chain to both mix quickly and concentrate near the global minimizer. We bypass this by combining “simulated annealing" with the stochastic gradient Langevin dynamics, and gradually decreasing the temperature of the chain in order to approach the global minimizer. As a corollary one can approximately minimize a nonconvex function that is close to a convex function; however, the closeness can deteriorate as one moves away from the optimum.

Small-loss bounds for online learning with partial information

Tue, 03 Jul 2018 00:00:00 +0000

We consider the problem of adversarial (non-stochastic) online learning with partial information feedback, where at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems where the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are non-negative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called “small-loss” $o(\alpha L^{\star})$ regret bounds with high probability, where $\alpha$ is the independence number of the graph, and $L^{\star}$ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e. utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications such as semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), as well as learning with slowly changing (shifting) comparators. In the special case of classical bandit and semi-bandit problems, we provide optimal small-loss, high-probability guarantees of $\tilde{O}(\sqrt{dL^{\star}})$ for actual regret, where $d$ is the number of actions, answering open questions of Neu. Previous bounds for bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal $\tilde{O}(\sqrt{\kappa L^{\star}})$ regret guarantee for fixed feedback graphs with clique-partition number at most $\kappa$.

Efficient Contextual Bandits in Non-stationary Worlds

Tue, 03 Jul 2018 00:00:00 +0000

Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution. We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret $\mathcal{O}(\sqrt{ST})$ if there are $T$ rounds with $S$ stationary periods, or more generally $\mathcal{O}(\Delta^{1/3}T^{2/3})$ where $\Delta$ is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit. Furthermore, while the results above require tuning a parameter based on the unknown quantity $S$ or $\Delta$, we also develop a parameter free algorithm achieving regret $\min\{S^{1/4}T^{3/4}, \Delta^{1/5}T^{4/5}\}$. This improves and generalizes the best existing result $\Delta^{0.18}T^{0.82}$ by Karnin and Anava (2016) which only holds for the two-armed bandit problem.

Adaptivity to Smoothness in X-armed bandits

Tue, 03 Jul 2018 00:00:00 +0000

We study the stochastic continuum-armed bandit problem from the angle of adaptivity to \emph{unknown regularity} of the reward function $f$. We prove that there exists no strategy for the cumulative regret that adapts optimally to the \emph{smoothness} of $f$. We show however that such minimax optimal adaptive strategies exist if the learner is given \emph{extra-information} about $f$. Finally, we complement our positive results with matching lower bounds.

Learning Mixtures of Linear Regressions with Nearly Optimal Complexity

Tue, 03 Jul 2018 00:00:00 +0000

Mixtures of Linear Regressions (MLR) is an important mixture model with many applications. In this model, each observation is generated from one of the several unknown linear regression components, where the identity of the generated component is also unknown. Previous works either assume strong assumptions on the data distribution or have high complexity. This paper proposes a fixed parameter tractable algorithm for the problem under general conditions, which achieves global convergence and the sample complexity scales nearly linearly in the dimension. In particular, different from previous works that require the data to be from the standard Gaussian, the algorithm allows the data from Gaussians with different covariances. When the conditional number of the covariances and the number of components are fixed, the algorithm has nearly optimal sample complexity $N = \tilde{O}(d)$ as well as nearly optimal computational complexity $\tilde{O}(Nd)$, where $d$ is the dimension of the data space. To the best of our knowledge, this approach provides the first such recovery guarantee for this general setting.

Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

Tue, 03 Jul 2018 00:00:00 +0000

We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that given $\tilde{O}(dr^{2})$ random linear measurements of a rank $r$ positive semidefinite matrix $X^{\star}$, we can recover $X^{\star}$ by parameterizing it by $UU^\top$ with $U\in \mathbb R^{d\times d}$ and minimizing the squared loss, even if $r \ll d$. We prove that starting from a small initialization, gradient descent recovers $X^{\star}$ in $\tilde{O}(\sqrt{r})$ iterations approximately. The results solve the conjecture of Gunasekar et al.’17 under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.

Efficient Convex Optimization with Membership Oracles

Tue, 03 Jul 2018 00:00:00 +0000

We consider the problem of minimizing a convex function over a convex set given access only to an evaluation oracle for the function and a membership oracle for the set. We give a simple algorithm which solves this problem with $\tilde{O}(n^{2})$ oracle calls and $\tilde{O}(n^{3})$ additional arithmetic operations. Using this result, we obtain more efficient reductions among the five basic oracles for convex sets and functions defined by Gr{ö}tschel, Lov{á}sz, and Schrijver (1988).

Marginal Singularity, and the Benefits of Labels in Covariate-Shift

Tue, 03 Jul 2018 00:00:00 +0000

We present new minimax results that concisely capture the relative benefits of source and target labeled data, under {covariate-shift}. Namely, we show that, in general classification settings, the benefits of target labels are controlled by a \emph{transfer-exponent} $\gamma$ that encodes how \emph{singular} $Q$ is locally w.r.t. $P$, and interestingly allows situations where transfer did not seem possible under previous insights. In fact, our new minimax analysis – in terms of $\gamma$ – reveals a \emph{continuum of regimes} ranging from situations where target labels have little benefit, to regimes where target labels dramatically improve classification. We then show that a recently proposed semi-supervised procedure can be extended to adapt to unknown $\gamma$, and therefore requests target labels only when beneficial, while achieving nearly minimax transfer rates.

A Faster Approximation Algorithm for the Gibbs Partition Function

Tue, 03 Jul 2018 00:00:00 +0000

We consider the problem of estimating the partition function $Z(\beta)=\sum_x \exp(-\beta H(x))$ of a Gibbs distribution with a Hamilton $H(\cdot)$, or more precisely the logarithm of the ratio $q=\ln Z(0)/Z(\beta)$. It has been recently shown how to approximate $q$ with high probability assuming the existence of an oracle that produces samples from the Gibbs distribution for a given parameter value in $[0,\beta]$. The current best known approach due to Huber (2015) uses $O(q\ln n\cdot[\ln q + \ln \ln n+\varepsilon^{-2}])$ oracle calls on average where $\varepsilon$ is the desired accuracy of approximation and $H(\cdot)$ is assumed to lie in $\{0\}\cup[1,n]$. We improve the complexity to $O(q\ln n\cdot\varepsilon^{-2})$ oracle calls. We also show that the same complexity can be achieved if exact oracles are replaced with approximate sampling oracles that are within $O(\frac{\varepsilon^2}{q\ln n})$ variation distance from exact oracles. Finally, we prove a lower bound of $\Omega(q\cdot \varepsilon^{-2})$ oracle calls under a natural model of computation.

Counting Motifs with Graph Sampling

Tue, 03 Jul 2018 00:00:00 +0000

Applied researchers often construct a network from data that has been collected from a random sample of nodes, with the goal to infer properties of the parent network from the sampled version. Two of the most widely used sampling schemes are subgraph sampling, where we sample each vertex independently with probability $p$ and observe the subgraph induced by the sampled vertices, and neighborhood sampling, where we additionally observe the edges between the sampled vertices and their neighbors. In this paper, we study the problem of estimating the number of motifs as induced subgraphs under both models from a statistical perspective. We show that: for parent graph $G$ with maximal degree $d$, for any connected motif $h$ on $k$ vertices, to estimate the number of copies of $h$ in $G$, denoted by $s=\mathsf{s}(h,G)$, with a multiplicative error of $\epsilon$,

For subgraph sampling, the optimal sampling ratio $p$ is $\Theta_{k}(\max\{ (s\epsilon^2)^{-\frac{1}{k}}, \; \frac{d^{k-1}}{s\epsilon^{2}} \})$, which only depends on the size of the motif but not its actual topology. Furthermore, we show that Horvitz-Thompson type estimators are universally optimal for any connected motifs.
For neighborhood sampling, we propose a family of estimators, encompassing and outperforming the Horvitz-Thompson estimator and achieving the sampling ratio $O_{k}(\min\{ (\frac{d}{s\epsilon^2})^{\frac{1}{k-1}}, \; \sqrt{\frac{d^{k-2}}{s\epsilon^2}}\})$, which again only depends on the size of $h$. This is shown to be optimal for all motifs with at most $4$ vertices and cliques of all sizes.

The matching minimax lower bounds are established using certain algebraic properties of subgraph counts. These results allow us to quantify how much more informative neighborhood sampling is than subgraph sampling, as empirically verified by experiments on synthetic and real-world data. We also address the issue of adaptation to the unknown maximum degree, and study specific problems for parent graphs with additional structures, e.g., trees or planar graphs.

Efficient Algorithms for Outlier-Robust Regression

Tue, 03 Jul 2018 00:00:00 +0000

We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution ${\mathcal{D}}$ and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to ${\mathcal{D}}$, assuming that the marginal distribution of $\mathcal{D}$ over the input space is \emph{certifiably hypercontractive}. This natural property is satisfied by many well-studied distributions such as Gaussian, strongly log-concave distributions and, uniform distribution on the hypercube among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets.

Information Directed Sampling and Bandits with Heteroscedastic Noise

Tue, 03 Jul 2018 00:00:00 +0000

In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy evaluations. Typically, the observation noise is assumed to be independent of the evaluation point and to satisfy a tail bound uniformly on the domain; a restrictive assumption for many applications. In this work, we consider bandits with heteroscedastic noise, where we explicitly allow the noise distribution to depend on the evaluation point. We show that this leads to new trade-offs for information and regret, which are not taken into account by existing approaches like upper confidence bound algorithms (UCB) or Thompson Sampling. To address these shortcomings, we introduce a frequentist regret analysis framework, that is similar to the Bayesian framework of Russo and Van Roy (2014), and we prove a new high-probability regret bound for general, possibly randomized policies, which depends on a quantity we refer to as regret-information ratio. From this bound, we define a frequentist version of Information Directed Sampling (IDS) to minimize the regret-information ratio over all possible action sampling distributions. This further relies on concentration inequalities for online least squares regression in separable Hilbert spaces, which we generalize to the case of heteroscedastic noise. We then formulate several variants of IDS for linear and reproducing kernel Hilbert space response functions, yielding novel algorithms for Bayesian optimization. We also prove frequentist regret bounds, which in the homoscedastic case recover known bounds for UCB, but can be much better when the noise is heteroscedastic. Empirically, we demonstrate in a linear setting with heteroscedastic noise, that some of our methods can outperform UCB and Thompson Sampling, while staying competitive when the noise is homoscedastic.

Restricted Eigenvalue from Stable Rank with Applications to Sparse Linear Regression

Tue, 03 Jul 2018 00:00:00 +0000

High-dimensional settings, where the data dimension ($d$) far exceeds the number of observations ($n$), are common in many statistical and machine learning applications. Methods based on $\ell_1$-relaxation, such as Lasso, are very popular for sparse recovery in these settings. Restricted Eigenvalue (RE) condition is among the weakest, and hence the most general, condition in literature imposed on the Gram matrix that guarantees nice statistical properties for the Lasso estimator. It is hence natural to ask: what families of matrices satisfy the RE condition? Following a line of work in this area, we construct a new broad ensemble of dependent random design matrices that have an explicit RE bound. Our construction starts with a fixed (deterministic) matrix $X \in \mathbb{R}^{n \times d}$ satisfying a simple stable rank condition, and we show that a matrix drawn from the distribution $X \Phi^\top \Phi$, where $\Phi \in \mathbb{R}^{m \times d}$ is a subgaussian random matrix, with high probability, satisfies the RE condition. This construction allows incorporating a fixed matrix that has an easily {\em verifiable} condition into the design process, and allows for generation of {\em compressed} design matrices that have a lower storage requirement than a standard design matrix. We give two applications of this construction to sparse linear regression problems, including one to a compressed sparse regression setting where the regression algorithm only has access to a compressed representation of a fixed design matrix $X$.

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Tue, 03 Jul 2018 00:00:00 +0000

Nesterov’s accelerated gradient descent (AGD), an instance of the general family of “momentum methods,” provably achieves faster convergence rate than gradient descent (GD) in the convex setting. While these methods are widely used in modern \emph{nonconvex} applications, including training of deep neural networks, whether they are provably superior to GD in the nonconvex setting remains open. This paper studies a simple variant of Nesterov’s AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, matching the best known convergence rate, which is faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first direct acceleration (single-loop) algorithm that is provably faster than GD in general nonconvex setting—all previous nonconvex accelerated algorithms rely on more complex mechanisms such as nested loops and proximal terms. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases on each step even for nonconvex functions, and (2) a novel framework called \emph{improve or localize}, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon

Tue, 03 Jul 2018 00:00:00 +0000

In reinforcement learning (RL), problems with long planning horizons are perceived as very challenging. The recent advances in PAC RL, however, show that the sample complexity of RL does not depend on planning horizon except at a superficial level. How can we explain such a difference? Noting that the technical assumptions in these upper bounds might have hidden away the challenges of long horizons, we ask the question: \emph{can we prove a lower bound with a horizon dependence when such assumptions are removed?} We also provide a few observations on the desired characteristics of the lower bound construction.

The Vertex Sample Complexity of Free Energy is Polynomial

Tue, 03 Jul 2018 00:00:00 +0000

The free energy is a key quantity which is associated to Markov random fields. Classical results in statistical physics show how, given an analytic formula of the free energy, it is possible to compute many key quantities associated with Markov random fields including quantities such as magnetization and the location of various phase transitions. Given a massive Markov random field on $n$ nodes, can a small sample from it provide a rough approximation to the free energy $\mathcal{F}_n = \log{Z_n}$? Results in the graph limit literature by Borgs, Chayes, Lov{á}sz, S{ó}s, and Vesztergombi show that for Ising models on $n$ nodes and interactions of strength $\Theta(1/n)$, an $\epsilon$ approximation to $\log Z_n / n$ can be achieved by sampling a randomly induced model on $2^{O(1/\epsilon^2)}$ nodes. We show that the sampling complexity of this problem is {\em polynomial in }$1/\epsilon$. We further show a polynomial dependence on $\epsilon$ cannot be avoided. Our results are very general as they apply to higher order Markov random fields. For Markov random fields of order $r$, we obtain an algorithm that achieves $\epsilon$ approximation using a number of samples polynomial in $r$ and $1/\epsilon$ and running time that is $2^{O(1/\epsilon^2)}$ up to polynomial factors in $r$ and $\epsilon$. For ferromagnetic Ising models, the running time is polynomial in $1/\epsilon$. Our results are intimately connected to recent research on the regularity lemma and property testing, where the interest is in finding which properties can tested within $\epsilon$ error in time polynomial in $1/\epsilon$. In particular, our proofs build on results of Alon, de la Vega, Kannan and Karpinski, who also introduced the notion of polynomial vertex sample complexity. Another critical ingredient of the proof is an effective bound by the authors of this paper relating the variational free energy and the free energy.

The Mean-Field Approximation: Information Inequalities, Algorithms, and Complexity

Tue, 03 Jul 2018 00:00:00 +0000

The mean field approximation to the Ising model is a canonical variational tool that is used for analysis and inference in Ising models. We provide a simple and optimal bound for the KL error of the mean field approximation for Ising models on general graphs, and extend it to higher order Markov random fields. Our bound improves on previous bounds obtained in work in the graph limit literature by Borgs, Chayes, Lov{á}sz, S{ó}s, and Vesztergombi and recent works by Basak and Mukherjee, and Eldan. Our bound is tight up to lower order terms. Building on the methods used to prove the bound, along with techniques from combinatorics and optimization, we study the algorithmic problem of estimating the (variational) free energy for Ising models and general Markov random fields. For a graph $G$ on $n$ vertices and interaction matrix $J$ with Frobenius norm $\|{J} \|_F$, we provide algorithms that approximate the free energy within an additive error of $\epsilon n \|J\|_F$ in time $\exp(poly(1/\epsilon))$. We also show that approximation within $(n \|J\|_F)^{1-\delta}$ is NP-hard for every $\delta > 0$. Finally, we provide more efficient approximation algorithms, which find the optimal mean field approximation, for ferromagnetic Ising models and for Ising models satisfying Dobrushin’s condition.

Accelerating Stochastic Gradient Descent for Least Squares Regression

Tue, 03 Jul 2018 00:00:00 +0000

There is widespread sentiment that fast gradient methods (e.g. Nesterov’s acceleration, conjugate gradient, heavy ball) are not effective for the purposes of stochastic optimization due to their instability and error accumulation. Numerous works have attempted to quantify these instabilities in the face of either statistical or non-statistical. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes this conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems.

Approximate Nearest Neighbors in Limited Space

Tue, 03 Jul 2018 00:00:00 +0000

We consider the $(1+\epsilon)$-approximate nearest neighbor search problem: given a set $X$ of $n$ points in a $d$-dimensional space, build a data structure that, given any query point $y$, finds a point $x \in X$ whose distance to $y$ is at most $(1+\epsilon) \min_{x \in X} \|x-y\|$ for an accuracy parameter $\epsilon \in (0,1)$. Our main result is a data structure that occupies only $O(\epsilon^{-2} n \log(n) \log(1/\epsilon))$ bits of space, assuming all point coordinates are integers in the range $\{-n^{O(1)} \ldots n^{O(1)}\}$, i.e., the coordinates have $O(\log n)$ bits of precision. This improves over the best previously known space bound of $O(\epsilon^{-2} n \log(n)^2)$, obtained via the randomized dimensionality reduction method of Johnson and Lindenstrauss (1984). We also consider the more general problem of estimating all distances from a collection of query points to all data points $X$, and provide almost tight upper and lower bounds for the space complexity of this problem.

Subpolynomial trace reconstruction for random strings \{and arbitrary deletion probability

Tue, 03 Jul 2018 00:00:00 +0000

The deletion-insertion channel takes as input a bit string ${\bf x}\in \{0,1\}^{n}$, and outputs a string where bits have been deleted and inserted independently at random. The trace reconstruction problem is to recover $\bf x$ from many independent outputs (called “traces”) of the deletion-insertion channel applied to $\bf x$. We show that if $\bf x$ is chosen uniformly at random, then $\exp(O(\log^{1/3} n))$ traces suffice to reconstruct $\bf x$ with high probability. For the deletion channel with deletion probability $q<1/2$ the earlier upper bound was $\exp(O(\log^{1/2} n))$. The case of $q\geq 1/2$ or the case where insertions are allowed has not been previously analysed, and therefore the earlier upper bound was as for worst-case strings, i.e., $\exp(O( n^{1/3}))$. A key ingredient in our proof is a delicate two-step alignment procedure where we estimate the location in each trace corresponding to a given bit of $\bf x$. The alignment is done by viewing the strings as random walks, and comparing the increments in the walk associated with the input string and the trace, respectively.

The Many Faces of Exponential Weights in Online Learning

Tue, 03 Jul 2018 00:00:00 +0000

A standard introduction to online learning might place Online Gradient Descent at its center and then proceed to develop generalizations and extensions like Online Mirror Descent and second-order methods. Here we explore the alternative approach of putting Exponential Weights (EW) first. We show that many standard methods and their regret bounds then follow as a special case by plugging in suitable surrogate losses and playing the EW posterior mean. For instance, we easily recover Online Gradient Descent by using EW with a Gaussian prior on linearized losses, and, more generally, all instances of Online Mirror Descent based on regular Bregman divergences also correspond to EW with a prior that depends on the mirror map. Furthermore, appropriate quadratic surrogate losses naturally give rise to Online Gradient Descent for strongly convex losses and to Online Newton Step. We further interpret several recent adaptive methods (iProd, Squint, and a variation of Coin Betting for experts) as a series of closely related reductions to exp-concave surrogate losses that are then handled by Exponential Weights. Finally, a benefit of our EW interpretation is that it opens up the possibility of sampling from the EW posterior distribution instead of playing the mean. As already observed by Bubeck and Eldan, this recovers the best-known rate in Online Bandit Linear Optimization.

Cutting plane methods can be extended into nonconvex optimization

Tue, 03 Jul 2018 00:00:00 +0000

We show that it is possible to obtain an $O(\epsilon^{-4/3})$ runtime — including computational cost — for finding $\epsilon$-stationary points of nonconvex functions using cutting plane methods. This improves on the best known epsilon dependence achieved by cubic regularized Newton of $O(\epsilon^{-3/2})$ as proved by Nesterov and Polyak (2006). Our techniques utilize the convex until proven guilty principle proposed by Carmon, Duchi, Hinder, and Sidford (2017).

Open problem: Improper learning of mixtures of Gaussians

Tue, 03 Jul 2018 00:00:00 +0000

We ask whether there exists an efficient unsupervised learning algorithm for mixture of Gaussians in the over-complete case (number of mixtures is larger than the dimension). The notion of learning is taken to be worst-case compression-based, to allow for improper learning.

Actively Avoiding Nonsense in Generative Models

Tue, 03 Jul 2018 00:00:00 +0000

A generative model may generate utter nonsense when it is fit to maximize the likelihood of observed data. This happens due to “model error,” i.e., when the true data generating distribution does not fit within the class of generative models being learned. To address this, we propose a model of active distribution learning using a binary invalidity oracle that identifies some examples as clearly invalid, together with random positive examples sampled from the true distribution. The goal is to maximize the likelihood of the positive examples subject to the constraint of (almost) never generating examples labeled invalid by the oracle. Guarantees are agnostic compared to a class of probability distributions. We first show that proper learning may require exponentially many queries to the invalidity oracle. We then give an improper distribution learning algorithm that uses only polynomially many queries.

Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

Tue, 03 Jul 2018 00:00:00 +0000

We examine the theoretical properties of enforcing priors provided by generative deep neural networks via empirical risk minimization. In particular we consider two models, one in which the task is to invert a generative neural network given access to its last layer and another in which the task is to invert a generative neural network given only compressive linear observations of its last layer. We establish that in both cases, in suitable regimes of network layer sizes and a randomness assumption on the network weights, that the non-convex objective function given by empirical risk minimization does not have any spurious stationary points. That is, we establish that with high probability, at any point away from small neighborhoods around two scalar multiples of the desired solution, there is a descent direction. Hence, there are no local minima, saddle points, or other stationary points outside these neighborhoods. These results constitute the first theoretical guarantees which establish the favorable global geometry of these non-convex optimization problems, and they bridge the gap between the empirical success of enforcing deep generative priors and a rigorous understanding of non-linear inverse problems.

Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance

Tue, 03 Jul 2018 00:00:00 +0000

We present \emph{Local Moment Matching (LMM)}, a unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. We construct an efficiently computable estimator that achieves the minimax rates in estimating the distribution up to permutation, and show that the plug-in approach of our unlabeled distribution estimator is “universal" in estimating symmetric functionals of discrete distributions. Instead of doing best polynomial approximation explicitly as in existing literature of functional estimation, the plug-in approach conducts polynomial approximation implicitly and attains the optimal sample complexity for the entropy, power sum and support size functionals.

Geometric Lower Bounds for Distributed Parameter Estimation under Communication Constraints

Tue, 03 Jul 2018 00:00:00 +0000

We consider parameter estimation in distributed networks, where each sensor in the network observes an independent sample from an underlying distribution and has $k$ bits to communicate its sample to a centralized processor which computes an estimate of a desired parameter. We develop lower bounds for the minimax risk of estimating the underlying parameter under squared $\ell_2$ loss for a large class of distributions. Our results show that under mild regularity conditions, the communication constraint reduces the effective sample size by a factor of $d$ when $k$ is small, where $d$ is the dimension of the estimated parameter. Furthermore, this penalty reduces at most exponentially with increasing $k$, which is the case for some models, e.g., estimating high-dimensional distributions. For other models however, we show that the sample size reduction is re-mediated only linearly with increasing $k$, e.g. when some sub-Gaussian structure is available. We apply our results to the distributed setting with product Bernoulli model, multinomial model, and dense/sparse Gaussian location models which recover or strengthen existing results. Our approach significantly deviates from existing approaches for developing information-theoretic lower bounds for communication-efficient estimation. We circumvent the need for strong data processing inequalities used in prior work and develop a geometric approach which builds on a new representation of the communication constraint. This approach allows us to strengthen and generalize existing results with simpler and more transparent proofs.

Certified Computation from Unreliable Datasets

Tue, 03 Jul 2018 00:00:00 +0000

A wide range of learning tasks require human input in labeling massive data. The collected data though are usually low quality and contain inaccuracies and errors. As a result, modern science and business face the problem of learning from unreliable data sets. In this work, we provide a generic approach that is based on \textit{verification} of only few records of the data set to guarantee high quality learning outcomes for various optimization objectives. Our method, identifies small sets of critical records and verifies their validity. We show that many problems only need $\text{poly}(1/\varepsilon)$ verifications, to ensure that the output of the computation is at most a factor of $(1 \pm \varepsilon)$ away from the truth. For any given instance, we provide an \textit{instance optimal} solution that verifies the minimum possible number of records to approximately certify correctness. Then using this instance optimal formulation of the problem we prove our main result: “every function that satisfies some Lipschitz continuity condition can be certified with a small number of verifications”. We show that the required Lipschitz continuity condition is satisfied even by some NP-complete problems, which illustrates the generality and importance of this theorem. In case this certification step fails, an invalid record will be identified. Removing these records and repeating until success, guarantees that the result will be accurate and will depend only on the verified records. Surprisingly, as we show, for several computation tasks more efficient methods are possible. These methods always guarantee that the produced result is not affected by the invalid records, since any invalid record that affects the output will be detected and verified.

Size-Independent Sample Complexity of Neural Networks

Tue, 03 Jul 2018 00:00:00 +0000

We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest.

Online Learning: Sufficient Statistics and the Burkholder Method

Tue, 03 Jul 2018 00:00:00 +0000

We uncover a fairly general principle in online learning: If a regret inequality can be (approximately) expressed as a function of certain "sufficient statistics" for the data sequence, then there exists a special Burkholder function that 1) can be used algorithmically to achieve the regret bound and 2) only depends on these sufficient statistics, not the entire data sequence, so that the online strategy is only required to keep the sufficient statistics in memory. This characterization is achieved by bringing the full power of the Burkholder Method—originally developed for certifying probabilistic martingale inequalities—to bear on the online learning setting. To demonstrate the scope and effectiveness of the Burkholder method, we develop a novel online strategy for matrix prediction that attains a regret bound corresponding to the variance term in matrix concentration inequalities. We also present a linear-time/space prediction strategy for parameter-free supervised learning with linear classes and general smooth norms.

Logistic Regression: The Importance of Being Improper

Tue, 03 Jul 2018 00:00:00 +0000

Learning linear predictors with the logistic loss—both in stochastic and online settings—is a fundamental task in machine learning and statistics, with direct connections to classification and boosting. Existing “fast rates” for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logistic loss is $1$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of McMahan and Streeter (2012) when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. We also show that the improved dependence on predictor norm is near-optimal. Leveraging this improved dependency on the predictor norm yields the following applications: (a) we give algorithms for online bandit multiclass learning with the logistic loss with an $\tilde{O}(\sqrt{n})$ relative mistake bound across essentially all parameter ranges, thus providing a solution to the COLT 2009 open problem of Abernethy and Rakhlin (2009), and (b) we give an adaptive algorithm for online multiclass boosting with optimal sample complexity, thus partially resolving an open problem of Beygelzimer et al. (2015) and Jung et al. (2017). Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parametric and even nonparametric settings.

Calibrating Noise to Variance in Adaptive Data Analysis

Tue, 03 Jul 2018 00:00:00 +0000

Datasets are often used multiple times and each successive analysis may depend on the outcome of previous analyses. Standard techniques for ensuring generalization and statistical validity do not account for this adaptive dependence. A recent line of work studies the challenges that arise from such adaptive data reuse by considering the problem of answering a sequence of “queries” about the data distribution where each query may depend arbitrarily on answers to previous queries. The strongest results obtained for this problem rely on differential privacy – a strong notion of algorithmic stability with the important property that it “composes” well when data is reused. However the notion is rather strict, as it requires stability under replacement of an arbitrary data element. The simplest algorithm is to add Gaussian (or Laplace) noise to distort the empirical answers. However, analysing this technique using differential privacy yields suboptimal accuracy guarantees when the queries have low variance. Here we propose a relaxed notion of stability based on KL divergence that also composes adaptively. We show that our notion of stability implies a bound on the mutual information between the dataset and the output of the algorithm and then derive new generalization guarantees implied by bounded mutual information. We demonstrate that a simple and natural algorithm based on adding noise scaled to the standard deviation of the query provides our notion of stability. This implies an algorithm that can answer statistical queries about the dataset with substantially improved accuracy guarantees for low-variance queries. The only previous approach that provides such accuracy guarantees is based on a more involved differentially private median-of-means algorithm and its analysis exploits stronger “group” stability of the algorithm.

Hidden Integrality of SDP Relaxations for Sub-Gaussian Mixture Models

Tue, 03 Jul 2018 00:00:00 +0000

We consider the problem of finding discrete clustering structures under Sub-Gaussian Mixture Models. We establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solutions to the SDP are not integer-valued in general, their estimation errors can be upper bounded by the error of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signal-to-noise ratio. To the best of our knowledge, this is the first exponentially decaying error bound for convex relaxations of mixture models. A special case of this result shows that in certain regimes the SDP solutions are in fact integral and exact, improving on existing exact recovery results for convex relaxations. More generally, our result establishes sufficient conditions for the SDP to correctly recover the cluster memberships of at least $(1-\delta)$ fraction of the points for any $\delta\in(0,1)$. Error bounds for estimating cluster centers can also be derived directly from our results.

Fitting a Putative Manifold to Noisy Data

Tue, 03 Jul 2018 00:00:00 +0000

In the present work, we give a solution to the following question from manifold learning. Suppose data belonging to a high dimensional Euclidean space is drawn independently, identically distributed from a measure supported on a low dimensional twice differentiable embedded manifold $M$, and corrupted by a small amount of gaussian noise. How can we produce a manifold $M’$ whose Hausdorff distance to $M$ is small and whose reach is not much smaller than the reach of $M$?

Detection limits in the high-dimensional spiked rectangular model

Tue, 03 Jul 2018 00:00:00 +0000

We study the problem of detecting the presence of a single unknown spike in a rectangular data matrix, in a high-dimensional regime where the spike has fixed strength and the aspect ratio of the matrix converges to a finite limit. This setup includes Johnstone’s spiked covariance model. We analyze the likelihood ratio of the spiked model against an “all noise" null model of reference, and show it has asymptotically Gaussian fluctuations in a region below—but in general not up to—the so-called BBP threshold from random matrix theory. Our result parallels earlier findings of Onatski et al. (2013) and Johnstone-Onatski (2015) for spherical spikes. We present a probabilistic approach capable of treating generic product priors. In particular, sparsity in the spike is allowed. Our approach operates through the principle of the cavity method from spin-glass theory. The question of the maximal parameter region where asymptotic normality is expected to hold is left open. This region, not necessarily given by BBP, is shaped by the prior in a non-trivial way. We conjecture that this is the entire paramagnetic phase of an associated spin-glass model, and is defined by the vanishing of the replica-symmetric solution of Lesieur et al. (2015).

Privacy-preserving Prediction

Tue, 03 Jul 2018 00:00:00 +0000

Ensuring differential privacy of models learned from sensitive user data is an important goal that has been studied extensively in recent years. It is now known that for some basic learning problems, especially those involving high-dimensional data, producing an accurate private model requires much more data than learning without privacy. At the same time, in many applications it is not necessary to expose the model itself. Instead users may be allowed to query the prediction model on their inputs only through an appropriate interface. Here we formulate the problem of ensuring privacy of individual predictions and investigate the overheads required to achieve it in several standard models of classification and regression. We first describe a simple baseline approach based on training several models on disjoint subsets of data and using standard private aggregation techniques to predict. We show that this approach has nearly optimal sample complexity for (realizable) PAC learning of any class of Boolean functions. At the same time, without strong assumptions on the data distribution, the aggregation step introduces a substantial overhead. We demonstrate that this overhead can be avoided for the well-studied class of thresholds on a line and for a number of standard settings of convex regression. The analysis of our algorithm for learning thresholds relies crucially on strong generalization guarantees that we establish for all differentially private prediction algorithms.

Log-concave sampling: Metropolis-Hastings algorithms are fast!

Tue, 03 Jul 2018 00:00:00 +0000

We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by running a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step to ensure the correct stationary distribution. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds reveal that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most $\delta$ for a density with condition number $\kappa$, we show that MALA requires $\mathcal{O} \big(\kappa d \log(1/\delta) \big)$ steps, as compared to the $\mathcal{O} \big(\kappa^2 d/\delta^2 \big)$ steps established in past work on ULA. We also demonstrate the gains of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for a zeroth-order method Metropolized random walk (MRW) and show that it mixes $\mathcal{O}(\kappa d)$ slower than MALA.

$\ell_1$ Regression using Lewis Weights Preconditioning and Stochastic Gradient Descent

Tue, 03 Jul 2018 00:00:00 +0000

We present preconditioned stochastic gradient descent (SGD) algorithms for the $\ell_1$ minimization problem $\min_{\boldsymbol{\mathit{x}}}\|\boldsymbol{\mathit{A}} \boldsymbol{\mathit{x}} - \boldsymbol{\mathit{b}}\|_1$ in the overdetermined case, where there are far more constraints than variables. Specifically, we have $\boldsymbol{\mathit{A}} \in \mathbb{R}^{n \times d}$ for $n \gg d$. Commonly known as the Least Absolute Deviations problem, $\ell_1$ regression can be used to solve many important combinatorial problems, such as minimum cut and shortest path. SGD-based algorithms are appealing for their simplicity and practical efficiency. Our primary insight is that careful preprocessing can yield preconditioned matrices $\tilde{\boldsymbol{\mathit{A}}}$ with strong properties (besides good condition number and low-dimension) that allow for faster convergence of gradient descent. In particular, we precondition using Lewis weights to obtain an isotropic matrix with fewer rows and strong upper bounds on all row norms. We leverage these conditions to find a good initialization, which we use along with recent smoothing reductions and accelerated stochastic gradient descent algorithms to achieve $\epsilon$ relative error in $\widetilde{O}(nnz(\boldsymbol{\mathit{A}}) + d^{2.5} \epsilon^{-2})$ time with high probability, where $nnz(\boldsymbol{\mathit{A}})$ is the number of non-zeros in $\boldsymbol{\mathit{A}}$. This improves over the previous best result using gradient descent for $\ell_1$ regression. We also match the best known running times for interior point methods in several settings. Finally, we also show that if our original matrix $\boldsymbol{\mathit{A}}$ is approximately isotropic and the row norms are approximately equal, we can give an algorithm that avoids using fast matrix multiplication and obtains a running time of $\widetilde{O}(nnz(\boldsymbol{\mathit{A}}) + s d^{1.5}\epsilon^{-2} + d^2\epsilon^{-2})$, where $s$ is the maximum number of non-zeros in a row of $\boldsymbol{\mathit{A}}$. In this setting, we beat the best interior point methods for certain parameter regimes.

Learning Single-Index Models in Gaussian Space

Tue, 03 Jul 2018 00:00:00 +0000

We consider regression problems where the response is a smooth but non-linear function of a $k$-dimensional projection of $p$ normally-distributed covariates, contaminated with additive Gaussian noise. The goal is to recover the range of the $k$-dimensional projection, i.e., the index space. This model is called the multi-index model, and the $k=1$ case is called the single-index model. For the single-index model, we characterize the population landscape of a natural semi-parametric maximum likelihood objective in terms of the link function and prove that it has no spurious local minima. We also propose and analyze an efficient iterative procedure that recovers the index space up to error $\epsilon$ using a sample size $\tilde{O}(p^{O(R^2/\mu)} + p/\epsilon^2)$, where $R$ and $\mu$, respectively, parameterize the smoothness of the link function and the signal strength. When a multi-index model is incorrectly specified as a single-index model, we prove that essentially the same procedure, with sample size $\tilde{O}(p^{O(kR^2/\mu)} + p/\epsilon^2)$, returns a vector that is $\epsilon$-close to being completely in the index space.

Minimax Bounds on Stochastic Batched Convex Optimization

Tue, 03 Jul 2018 00:00:00 +0000

We study the stochastic batched convex optimization problem, in which we use many \emph{parallel} observations to optimize a convex function given limited rounds of interaction. In each of $M$ rounds, an algorithm may query for information at $n$ points, and after issuing all $n$ queries, it receives unbiased noisy function and/or (sub)gradient evaluations at the $n$ points. After $M$ such rounds, the algorithm must output an estimator. We provide lower and upper bounds on the performance of such batched convex optimization algorithms in zeroth and first-order settings for Lipschitz convex and smooth strongly convex functions. Our rates of convergence (nearly) achieve the fully sequential rate once $M = O(d \log \log n)$, where $d$ is the problem dimension, but the rates may exponentially degrade as the dimension $d$ increases, in distinction from fully sequential settings.

Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms

Tue, 03 Jul 2018 00:00:00 +0000

We study the problem of robustly learning multi-dimensional histograms. A $d$-dimensional function $h: D \to \R$ is called a $k$-histogram if there exists a partition of the domain $D \subseteq \R^d$ into $k$ axis-aligned rectangles such that $h$ is constant within each such rectangle. Let $f: D \to \R$ be a $d$-dimensional probability density function and suppose that $f$ is $\mathrm{OPT}$-close, in $L_1$-distance, to an unknown $k$-histogram (with unknown partition). Our goal is to output a hypothesis that is $O(\mathrm{OPT}) + \epsilon$ close to $f$, in $L_1$-distance. We give an algorithm for this learning problem that uses $n = \tilde{O}_d(k/\eps^2)$ samples and runs in time $\tilde{O}_d(n)$. For any fixed dimension, our algorithm has optimal sample complexity, up to logarithmic factors, and runs in near-linear time. Prior to our work, the time complexity of the $d=1$ case was well-understood, but significant gaps in our understanding remained even for $d=2$.

Testing Symmetric Markov Chains From a Single Trajectory

Tue, 03 Jul 2018 00:00:00 +0000

The paper’s abstract in valid LaTeX, without non-standard macros or \cite commands. Classical distribution testing assumes access to i.i.d. samples from the distribution that is being tested. We initiate the study of Markov chain testing, assuming access to a {\em single trajectory of a Markov Chain.} In particular, we observe a single trajectory $X_0,\ldots,X_t,\ldots$ of an unknown, symmetric, and finite state Markov Chain $\cal M$. We do not control the starting state $X_0$, and we cannot restart the chain. Given our single trajectory, the goal is to test whether $\cal M$ is identical to a model Markov Chain ${\cal M}’$, or far from it under an appropriate notion of difference. We propose a measure of difference between two Markov chains, motivated by the early work of Kazakos [78], which captures the scaling behavior of the total variation distance between trajectories sampled from the Markov chains as the length of these trajectories grows. We provide efficient testers and information-theoretic lower bounds for testing identity of symmetric Markov chains under our proposed measure of difference, which are tight up to logarithmic factors if the hitting times of the model chain ${\cal M}’$ is $\tilde{O}(n)$ in the size of the state space $n$.

Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning

Tue, 03 Jul 2018 00:00:00 +0000

Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as “lock-in probability”. We also introduce a new projection scheme, in which the time between successive projections increases exponentially. This scheme allows one to elegantly transform a lock-in probability into a convergence rate result for projected two-timescale SA. From this latter result, we then extract key insights on stepsize selection. As an application, we finally obtain convergence rates for the projected two-timescale RL algorithms GTD(0), GTD2, and TDC.

Detecting Correlations with Little Memory and Communication

Tue, 03 Jul 2018 00:00:00 +0000

We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise correlations with optimal sample complexity, the number of required memory/communication bits is at least quadratic in the dimension. Our results substantially improve those of Shamir (2014), which studied a similar question in a much more restricted setting. To the best of our knowledge, these are the first provable sample/memory/communication trade-offs for a practical estimation problem, using standard distributions, and in the natural regime where the memory/communication budget is larger than the size of a single data point. To derive our theorems, we prove a new information-theoretic result, which may be relevant for studying other information-constrained learning problems.

Black-Box Reductions for Parameter-free Online Learning in Banach Spaces

Tue, 03 Jul 2018 00:00:00 +0000

We introduce several new black-box reductions that significantly improve the design of adaptive and parameter-free online learning algorithms by simplifying analysis, improving regret guarantees, and sometimes even improving runtime. We reduce parameter-free online learning to online exp-concave optimization, we reduce optimization in a Banach space to one-dimensional optimization, and we reduce optimization over a constrained domain to unconstrained optimization. All of our reductions run as fast as online gradient descent. We use our new techniques to improve upon the previously best regret bounds for parameter-free learning, and do so for arbitrary norms.

Exact and Robust Conformal Inference Methods for Predictive Machine Learning with Dependent Data

Tue, 03 Jul 2018 00:00:00 +0000

We extend conformal inference to general settings that allow for time series data. Our proposal is developed as a randomization method and accounts for potential serial dependence by including block structures in the permutation scheme. As a result, the proposed method retains the exact, model-free validity when the data are i.i.d. or more generally exchangeable, similar to usual conformal inference methods. When exchangeability fails, as is the case for common time series data, the proposed approach is approximately valid under weak assumptions on the conformity score.

Non-Convex Matrix Completion Against a Semi-Random Adversary

Tue, 03 Jul 2018 00:00:00 +0000

Matrix completion is a well-studied problem with many machine learning applications. In practice, the problem is often solved by non-convex optimization algorithms. However, the current theoretical analysis for non-convex algorithms relies crucially on the assumption that each entry of the matrix is observed with exactly the same probability $p$, which is not realistic in practice. In this paper, we investigate a more realistic semi-random model, where the probability of observing each entry is {\em at least} $p$. Even with this mild semi-random perturbation, we can construct counter-examples where existing non-convex algorithms get stuck in bad local optima. In light of the negative results, we propose a pre-processing step that tries to re-weight the semi-random input, so that it becomes “similar” to a random input. We give a nearly-linear time algorithm for this problem, and show that after our pre-processing, all the local minima of the non-convex objective can be used to approximately recover the underlying ground-truth matrix.

Underdamped Langevin MCMC: A non-asymptotic analysis

Tue, 03 Jul 2018 00:00:00 +0000

We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves $\varepsilon$ error (in 2-Wasserstein distance) in $\mathcal{O}(\sqrt{d}/\varepsilon)$ steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is $\mathcal{O}(d/\varepsilon^2)$ steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom.

Smoothed Online Convex Optimization in High Dimensions via Online Balanced Descent

Tue, 03 Jul 2018 00:00:00 +0000

We study \emph{smoothed online convex optimization}, a version of online convex optimization where the learner incurs a penalty for changing her actions between rounds. Given a $\Omega(\sqrt{d})$ lower bound on the competitive ratio of any online algorithm, where $d$ is the dimension of the action space, we ask under what conditions this bound can be beaten. We introduce a novel algorithmic framework for this problem, Online Balanced Descent (OBD), which works by iteratively projecting the previous point onto a carefully chosen level set of the current cost function so as to balance the switching costs and hitting costs. We demonstrate the generality of the OBD framework by showing how, with different choices of “balance,” OBD can improve upon state-of-the-art performance guarantees for both competitive ratio and regret; in particular, OBD is the first algorithm to achieve a dimension-free competitive ratio, $3 + O(1/\alpha)$, for locally polyhedral costs, where $\alpha$ measures the “steepness” of the costs. We also prove bounds on the dynamic regret of OBD when the balance is performed in the dual space that are dimension-free and imply that OBD has sublinear static regret.

Incentivizing Exploration by Heterogeneous Users

Tue, 03 Jul 2018 00:00:00 +0000

We consider the problem of incentivizing exploration with heterogeneous agents. In this problem, $N$ bandit arms provide vector-valued outcomes equal to an unknown arm-specific attribute vector, perturbed by independent noise.Agents arrive sequentially and choose arms to pull based on their own private and heterogeneous linear utility functions over attributes and the estimates of the arms’ attribute vectors derived from observations of other agents’ past pulls. Agents are myopic and selfish and thus would choose the arm with maximum estimated utility. A principal, knowing only the distribution from which agents’ preferences are drawn, but not the specific draws, can offer arm-specific incentive payments to encourage agents to explore underplayed arms. The principal seeks to minimize the total expected cumulative regret incurred by agents relative to their best arms, while also making a small expected cumulative payment. We propose an algorithm that incentivizes arms played infrequently in the past whose probability of being played in the next round would be small without incentives. Under the assumption that each arm is preferred by at least a fraction $p > 0$ of agents, we show that this algorithm achieves expected cumulative regret of $O (N \e^{2/p} + N \log^3(T))$, using expected cumulative payments of $O(N^2 \e^{2/p})$. If $p$ is known or the distribution over agent preferences is discrete, the exponential term $\e^{2/p}$ can be replaced with suitable polynomials in $N$ and $1/p$. For discrete preferences, the regret’s dependence on $T$ can be eliminated entirely, giving constant (depending only polynomially on $N$ and $1/p$) expected regret and payments. This constant regret stands in contrast to the $\Theta(\log(T))$ dependence of regret in standard multi-armed bandit problems. It arises because even unobserved heterogeneity in agent preferences causes exploitation of arms to also explore arms fully; succinctly, heterogeneity provides free exploration.

Nonstochastic Bandits with Composite Anonymous Feedback

Tue, 03 Jul 2018 00:00:00 +0000

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.

A General Approach to Multi-Armed Bandits Under Risk Criteria

Tue, 03 Jul 2018 00:00:00 +0000

Different risk-related criteria have received recent interest in learning problems, where typically each case is treated in a customized manner. In this paper we provide a more systematic approach to analyzing such risk criteria within a stochastic multi-armed bandit (MAB) formulation. We identify a set of general conditions that yield a simple characterization of the oracle rule (which serves as the regret benchmark), and facilitate the design of upper confidence bound (UCB) learning policies. The conditions are derived from problem primitives, primarily focusing on the relation between the arm reward distributions and the (risk criteria) performance metric. Among other things, the work highlights some (possibly non-intuitive) subtleties that differentiate various criteria in conjunction with statistical properties of the arms. Our main findings are illustrated on several widely used objectives such as conditional value-at-risk, mean-variance, Sharpe-ratio, and more.

Near-Optimal Sample Complexity Bounds for Maximum Likelihood Estimation of Multivariate Log-concave Densities

Tue, 03 Jul 2018 00:00:00 +0000

We study the problem of learning multivariate log-concave densities with respect to a global loss function. We obtain the first upper bound on the sample complexity of the maximum likelihood estimator (MLE) for a log-concave density on $\mathbb{R}^d$, for all $d \geq 4$. Prior to this work, no finite sample upper bound was known for this estimator in more than $3$ dimensions. In more detail, we prove that for any $d \geq 1$ and $\epsilon>0$, given $\tilde{O}_d((1/\epsilon)^{(d+3)/2})$ samples drawn from an unknown log-concave density $f_0$ on $\mathbb{R}^d$, the MLE outputs a hypothesis $h$ that with high probability is $\epsilon$-close to $f_0$, in squared Hellinger loss. A sample complexity lower bound of $\Omega_d((1/\epsilon)^{(d+1)/2})$ was previously known for any learning algorithm that achieves this guarantee. We thus establish that the sample complexity of the log-concave MLE is near-optimal, up to an $\tilde{O}(1/\epsilon)$ factor.

Action-Constrained Markov Decision Processes With Kullback-Leibler Cost

Tue, 03 Jul 2018 00:00:00 +0000

This paper concerns computation of optimal policies in which the one-step reward function contains a cost term that models Kullback-Leibler divergence with respect to nominal dynamics. This technique was introduced by Todorov in 2007, where it was shown under general conditions that the solution to the average-reward optimality equations reduce to a simple eigenvector problem. Since then many authors have sought to apply this technique to control problems and models of bounded rationality in economics. A crucial assumption is that the input process is essentially unconstrained. For example, if the nominal dynamics include randomness from nature (e.g., the impact of wind on a moving vehicle), then the optimal control solution does not respect the exogenous nature of this disturbance. This paper introduces a technique to solve a more general class of action-constrained MDPs. The main idea is to solve an entire parameterized family of MDPs, in which the parameter is a scalar weighting the one-step reward function. The approach is new and practical even in the original unconstrained formulation.

Conference on Learning Theory 2018: Preface

Tue, 03 Jul 2018 00:00:00 +0000

Preface to the proceedings of the 31st Conference On Learning Theory.

Optimal Single Sample Tests for Structured versus Unstructured Network Data

Tue, 03 Jul 2018 00:00:00 +0000

We study the problem of testing, using only a single sample, between mean field distribu- tions (like Curie-Weiss, Erdős-Rényi) and structured Gibbs distributions (like Ising model on sparse graphs and Exponential Random Graphs). Our goal is to test without know- ing the parameter values of the underlying models: only the structure of dependencies is known. We develop a new approach that applies to both the Ising and Exponential Random Graph settings based on a general and natural statistical test. The test can dis- tinguish the hypotheses with high probability above a certain threshold in the (inverse) temperature parameter, and is optimal in that below the threshold no test can distinguish the hypotheses. The thresholds do not correspond to the presence of long-range order in the models. By aggregating information at a global scale, our test works even at very high temperatures. The proofs are based on distributional approximation and sharp concentration of quadratic forms, when restricted to Hamming spheres. The restriction to Hamming spheres is necessary, since otherwise any scalar statistic is useless without explicit knowledge of the temperature parameter. At the same time, this restriction changes the behavior of the functions under consideration, making it hard to directly apply standard methods (i.e., Stein’s method) for concentration of weakly dependent variables. Instead, we carry out an additional tensorization argument using a Markov chain that respects the symmetry of the Hamming sphere.

Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure

Tue, 03 Jul 2018 00:00:00 +0000

Recently, research in unsupervised learning has gravitated towards exploring statistical-computational gaps induced by sparsity. A line of work initiated in Berthet and Rigollet (2013) has aimed to explain these gaps through reductions to conjecturally hard problems from complexity theory. However, the delicate nature of average-case reductions has limited the development of techniques and often led to weaker hardness results that only apply to algorithms robust to different noise distributions or that do not need to know the parameters of the problem. We introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. Our new lower bounds include:

Planted Independent Set: We show tight lower bounds for detecting a planted independent set of size $k$ in a sparse Erdős-Rényi graph of size $n$ with edge density $\tilde{\Theta}(n^{-\alpha})$.
Planted Dense Subgraph: If $p > q$ are the edge densities inside and outside of the community, we show the first lower bounds for the general regime $q = \tilde{\Theta}(n^{-\alpha})$ and $p - q = \tilde{\Theta}(n^{-\gamma})$ where $\gamma \ge \alpha$, matching the lower bounds predicted in Chen and Xu (2016). Our lower bounds apply to a deterministic community size $k$, resolving a question raised in Hajek et al. (2015).
Biclustering: We show strong lower bounds for Gaussian biclustering as a simple hypothesis testing problem to detect a uniformly at random planted flat $k \times k$ submatrix.
Sparse Rank-1 Submatrix: We show that detection in the sparse spiked Wigner model is often harder than biclustering, and are able to obtain two different tight lower bounds for these problems with different reductions from planted clique.
Sparse PCA: We give a reduction between rank-1 submatrix and sparse PCA to obtain tight lower bounds in the less sparse regime $k \gg \sqrt{n}$, when the spectral algorithm is optimal over the SDP. We give an alternate reduction recovering the lower bounds of Berthet and Rigollet (2013) and Gao et al. (2017) in the simple hypothesis testing variant of sparse PCA. We also observe a subtlety in the complexity of sparse PCA that arises when the planted vector is biased.
Subgraph Stochastic Block Model: We introduce a model where two small communities are planted in an Erdős-Rényi graph of the same average edge density and give tight lower bounds yielding different hard regimes than planted dense subgraph.

Our results demonstrate that, despite the delicate nature of average-case reductions, using natural problems as intermediates can often be beneficial, as is the case in worst-case complexity. Our main technical contribution is to introduce a set of techniques for average-case reductions that: (1) maintain the level of signal in an instance of a problem; (2) alter its planted structure; and (3) map two initial high-dimensional distributions simultaneously to two target distributions approximately under total variation. We also give algorithms matching our lower bounds and identify the information-theoretic limits of the models we consider.

Online Variance Reduction for Stochastic Optimization

Tue, 03 Jul 2018 00:00:00 +0000

Modern stochastic optimization methods often rely on uniform sampling which is agnostic to the underlying characteristics of the data. This might degrade the convergence by yielding estimates that suffer from a high variance. A possible remedy is to employ non-uniform \emph{importance sampling} techniques, which take the structure of the dataset into account. In this work, we investigate a recently proposed setting which poses variance reduction as an online optimization problem with bandit feedback. We devise a novel and efficient algorithm for this setting that finds a sequence of importance sampling distributions competitive with the best fixed distribution in hindsight, the first result of this kind. While we present our method for sampling data points, it naturally extends to selecting coordinates or even blocks of thereof. Empirical validations underline the benefits of our method in several settings.

Active Tolerant Testing

Tue, 03 Jul 2018 00:00:00 +0000

In this work, we show that for a nontrivial hypothesis class $\mathcal C$, we can estimate the distance of a target function $f$ to $\mathcal C$ (estimate the error rate of the best $h\in \mathcal C$) using substantially fewer labeled examples than would be needed to actually {\em learn} a good $h \in \mathcal C$. Specifically, we show that for the class $\mathcal C$ of unions of $d$ intervals on the line, in the active learning setting in which we have access to a pool of unlabeled examples drawn from an arbitrary underlying distribution $\mathcal D$, we can estimate the error rate of the best $h \in \mathcal C$ to an additive error $\epsilon$ with a number of label requests that is {\em independent of $d$} and depends only on $\epsilon$. In particular, we make $O(\frac{1}{\epsilon^6}\log \frac{1}{\epsilon})$ label queries to an unlabeled pool of size $O(\frac{d}{\epsilon^2}\log \frac{1}{\epsilon})$. This task of estimating the distance of an unknown $f$ to a given class $\mathcal C$ is called {\em tolerant testing} or {\em distance estimation} in the testing literature, usually studied in a membership query model and with respect to the uniform distribution. Our work extends that of Balcan et al. (2012) who solved the {\em non}-tolerant testing problem for this class (distinguishing the zero-error case from the case that the best hypothesis in the class has error greater than $\epsilon$). We also consider the related problem of estimating the performance of a given learning algorithm $\mathcal A$ in this setting. That is, given a large pool of unlabeled examples drawn from distribution $\mathcal D$, can we, from only a few label queries, estimate how well $\mathcal A$ would perform if the entire dataset were labeled and given as training data to $\mathcal A$? We focus on $k$-Nearest Neighbor style algorithms, and also show how our results can be applied to the problem of hyperparameter tuning (selecting the best value of $k$ for the given learning problem).

Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form

Tue, 03 Jul 2018 00:00:00 +0000

Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer–Monteiro factorization approach for solving SDPs. For a large class of SDPs, upon random perturbation of the cost matrix, with high probability, we show that all approximate second-order stationary points are approximate global optima for the penalty formulation of appropriately rank-constrained SDPs, as long as the number of constraints scales sub-quadratically with the desired rank. Our result is based on a simple penalty function formulation of the rank-constrained SDP along with a smoothed analysis to avoid worst-case cost matrices. We particularize our results to two applications, namely, Max-Cut and matrix completion.

Hardness of Learning Noisy Halfspaces using Polynomial Thresholds

Tue, 03 Jul 2018 00:00:00 +0000

We prove the hardness of weakly learning halfspaces in the presence of adversarial noise using polynomial threshold functions (PTFs). In particular, we prove that for any constants $d \in \mathbb{Z}^+$ and $\eps > 0$, it is NP-hard to decide: given a set of $\{-1,1\}$-labeled points in $\mathbb{R}^n$ whether (YES Case) there exists a halfspace that classifies $(1-\eps)$-fraction of the points correctly, or (NO Case) any degree-$d$ PTF classifies at most $(1/2 + \eps)$-fraction of the points correctly. This strengthens to all constant degrees the previous NP-hardness of learning using degree-$2$ PTFs shown by Diakonikolas et al. (2011). The latter result had remained the only progress over the works of Feldman et al. (2006) and Guruswami et al. (2006) ruling out weakly proper learning adversarially noisy halfspaces.

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Tue, 03 Jul 2018 00:00:00 +0000

Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a \emph{simple and explicit finite time analysis} of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. A final section of the paper shows that all of our main results extend to the study of a variant of Q-learning applied to optimal stopping problems.

Langevin Monte Carlo and JKO splitting

Tue, 03 Jul 2018 00:00:00 +0000

Algorithms based on discretizing Langevin diffusion are popular tools for sampling from high-dimensional distributions. We develop novel connections between such Monte Carlo algorithms, the theory of Wasserstein gradient flow, and the operator splitting approach to solving PDEs. In particular, we show that a proximal version of the Unadjusted Langevin Algorithm corresponds to a scheme that alternates between solving the gradient flows of two specific functionals on the space of probability measures. Using this perspective, we derive some new non-asymptotic results on the convergence properties of this algorithm.

Approximation beats concentration? An approximation view on inference with smooth radial kernels

Tue, 03 Jul 2018 00:00:00 +0000

Positive definite kernels and their associated Reproducing Kernel Hilbert Spaces provide a mathematically compelling and practically competitive framework for learning from data. In this paper we take the approximation theory point of view to explore various aspects of smooth kernels related to their inferential properties. We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and “Fourier” coefficients of functions in the kernel space restricted to a discrete set of data points. We also investigate the fitting capacity of kernels, giving explicit bounds on the fat shattering dimension of the balls in Reproducing Kernel Hilbert spaces. Interestingly, the same properties that make kernels very effective approximators for functions in their “native” kernel space, also limit their capacity to represent arbitrary functions. We discuss various implications, including those for gradient descent type methods. It is important to note that most of our bounds are measure independent. Moreover, at least in moderate dimension, the bounds for eigenvalues are much tighter than the bounds which can be obtained from the usual matrix concentration results. For example, we see that eigenvalues of kernel matrices show nearly exponential decay with constants depending only on the kernel and the domain. We call this “approximation beats concentration” phenomenon as even when the data are sampled from a probability distribution, some of their aspects are better understood in terms of approximation theory.

Time-Space Tradeoffs for Learning Finite Functions from Random Evaluations, with Applications to Polynomials

Tue, 03 Jul 2018 00:00:00 +0000

We develop an extension of recent analytic methods for obtaining time-space tradeoff lower bounds for problems of learning from uniformly random labelled examples. With our methods we can obtain bounds for learning concept classes of finite functions from random evaluations even when the sample space of random inputs can be significantly smaller than the concept class of functions and the function values can be from an arbitrary finite set. At the core of our results, we reduce the time-space complexity of learning from random evaluations to the question of how much the corresponding evaluation matrix amplifies the 2-norms of “almost uniform” probability distributions. To analyze the latter, we formulate it as a semidefinite program, and we analyze its dual. In order to handle function values from arbitrary finite sets, we apply this norm amplification analysis to complex matrices. As applications that follow from our new techniques, we show that any algorithm that learns $n$-variate polynomial functions of degree at most $d$ over $\mathbb{F}_2$ with success at least $2^{-O(n)}$ from evaluations on randomly chosen inputs either requires space $\Omega(nm/d)$ or $2^{\Omega(n/d)}$ time where $m=(n/d)^{\Theta(d)}$ is the dimension of the space of such polynomials. These bounds are asymptotically optimal for polynomials of arbitrary constant degree since they match the tradeoffs achieved by natural learning algorithms for the problems. We extend these results to learning polynomials of degree at most $d$ over any odd prime field $\mathbb{F}_p$ where we show that $\Omega((mn/d)\log p)$ space or time $p^{\Omega(n/d)}$ is required. To derive our bounds for learning polynomials over finite fields, we show that an analysis of the dual of the corresponding semidefinite program follows from an understanding of the distribution of the bias of all degree $d$ polynomials with respect to uniformly random inputs.

Optimal Errors and Phase Transitions in High-Dimensional Generalized Linear Models

Tue, 03 Jul 2018 00:00:00 +0000

Generalized linear models (GLMs) arise in high-dimensional machine learning, statistics, communications and signal processing. % In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes or benchmarks models in neural networks. % We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal inference and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and dimensions are large and their ratio is fixed. % Non-rigorous predictions for the optimal inference and generalization errors existed for special cases of GLMs, e.g. for the perceptron in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. % Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance, and locate the associated sharp phase transitions separating learnable and non-learnable regions.

An Analysis of the t-SNE Algorithm for Data Visualization

Tue, 03 Jul 2018 00:00:00 +0000

A first line of attack in exploratory data analysis is \emph{data visualization}, i.e., generating a 2-dimensional representation of data that makes \emph{clusters} of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The \emph{t-SNE} heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the \emph{de facto} standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization – finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the “ground-truth” clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in \emph{partially} recovering cluster structure even when the above deterministic condition is not met.

Online learning over a finite action set with limited switching

Tue, 03 Jul 2018 00:00:00 +0000

We study the value of switching actions in the Prediction From Experts (PFE) problem and Adversarial Multi-Armed Bandits (MAB) problem. First, we revisit the well-studied and practically motivated setting of PFE with switching costs. Many algorithms are known to achieve the minimax optimal order of $O(\sqrt{T \log n})$ in \textit{expectation} for both regret and number of switches, where $T$ is the number of iterations and $n$ the number of actions. However, no \textit{high probability} guarantees are known. Our main technical contribution is the first algorithms which with high probability achieve this optimal order for both regret and number of switches. This settles an open problem of [Devroye et al., 2015], directly implies the first high probability guarantees for several problems of interest, and is efficiently adaptable to the related problem of online combinatorial optimization with limited switching. \par Next, to investigate the value of switching actions at a more granular level, we introduce the setting of \textit{switching budgets}, in which the algorithm is limited to $S \leq T$ switches between actions. This entails a limited number of free switches, in contrast to the unlimited number of expensive switches allowed in the switching cost setting. Using the above result and several reductions, we unify previous work and completely characterize the complexity of this switching budget setting up to small polylogarithmic factors: for both the PFE and MAB problems, for all switching budgets $S \leq T$, and for both expectation and high probability guarantees. For PFE, we show that the optimal rate is of order $\tilde{\Theta}(\sqrt{T\log n})$ for $S = \Omega(\sqrt{T\log n})$, and $\min(\tilde{\Theta}(\tfrac{T\log n}{S}), T)$ for $S = O(\sqrt{T \log n})$. Interestingly, the bandit setting does not exhibit such a phase transition; instead we show the minimax rate decays steadily as $\min(\tilde{\Theta}(\tfrac{T\sqrt{n}}{\sqrt{S}}), T)$ for all ranges of $S \leq T$. These results recover and generalize the known minimax rates for the (arbitrary) switching cost setting.

Unleashing Linear Optimizers for Group-Fair Learning and Optimization

Tue, 03 Jul 2018 00:00:00 +0000

Most systems and learning algorithms optimize average performance or average loss – one reason being computational complexity. However, many objectives of practical interest are more complex than simply average loss. This arises, for example, when balancing performance or loss with fairness across people. We prove that, from a computational perspective, optimizing arbitrary objectives that take into account performance over a small number of groups is not significantly harder to optimize than average performance. Our main result is a polynomial-time reduction that uses a linear optimizer to optimize an arbitrary (Lipschitz continuous) function of performance over a (constant) number of possibly-overlapping groups. This includes fairness objectives over small numbers of groups, and we further point out that other existing notions of fairness such as individual fairness can be cast as convex optimization and hence more standard convex techniques can be used. Beyond learning, our approach applies to multi-objective optimization, more generally.

Lower Bounds for Higher-Order Convex Optimization

Tue, 03 Jul 2018 00:00:00 +0000

State-of-the-art methods in mathematical optimization employ higher-order derivative information. We explore the limitations of higher-order optimization and prove that even for convex optimization, a polynomial dependence on the approximation guarantee and higher-order smoothness parameters is necessary. This refutes the hope that higher-order smoothness and higher-order derivatives can lead to dimension free polynomial time algorithms for convex optimization. As a special case, we show Nesterov’s accelerated cubic regularization method and higher-order methods to be nearly tight.

Faster Rates for Convex-Concave Games

Tue, 03 Jul 2018 00:00:00 +0000

We consider the use of no-regret algorithms to compute equilibria for particular classes of convex-concave games. While standard regret bounds would lead to convergence rates on the order of $O(T^{-1/2})$, recent work \citep{RS13,SALS15} has established $O(1/T)$ rates by taking advantage of a particular class of optimistic prediction algorithms. In this work we go further, showing that for a particular class of games one achieves a $O(1/T^2)$ rate, and we show how this applies to the Frank-Wolfe method and recovers a similar bound \citep{D15}. We also show that such no-regret techniques can even achieve a linear rate, $O(\exp(-T))$, for equilibrium computation under additional curvature assumptions.

Best of both worlds: Stochastic & adversarial best-arm identification

Tue, 03 Jul 2018 00:00:00 +0000

We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: $\backslash$emph{\{}Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards?{\}} First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.