Proceedings of Machine Learning Research

A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics

Sun, 18 Jun 2017 00:00:00 +0000

We study the Stochastic Gradient Langevin Dynamics (SGLD) algorithm for non-convex optimization. The algorithm performs stochastic gradient descent, where in each step it injects appropriately scaled Gaussian noise to the update. We analyze the algorithm’s hitting time to an arbitrary subset of the parameter space. Two results follow from our general theory: First, we prove that for empirical risk minimization, if the empirical risk is point-wise close to the (smooth) population risk, then the algorithm achieves an approximate local minimum of the population risk in polynomial time, escaping suboptimal local minima that only exist in the empirical risk. Second, we show that SGLD improves on one of the best known learnability results for learning linear classifiers under the zero-one loss.

Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-type of Risk Bounds

Sun, 18 Jun 2017 00:00:00 +0000

Although there exist plentiful theories of empirical risk minimization (ERM) for supervised learning, current theoretical understandings of ERM for a related problem—stochastic convex optimization (SCO), are limited. In this work, we strengthen the realm of ERM for SCO by exploiting smoothness and strong convexity conditions to improve the risk bounds. First, we establish an $\widetilde{O}(d/n + \sqrt{F}_*/n)$ risk bound when the random function is nonnegative, convex and smooth, and the expected function is Lipschitz continuous, where $d$ is the dimensionality of the problem, $n$ is the number of samples, and $F_*$ is the minimal risk. Thus, when $F_*$ is small we obtain an $\widetilde{O}(d/n)$ risk bound, which is analogous to the $\widetilde{O}(1/n)$ optimistic rate of ERM for supervised learning. Second, if the objective function is also $λ$-strongly convex, we prove an $\widetilde{O}(d/n + κF_*/n )$ risk bound where $κ$ is the condition number, and improve it to $O(1/[λn^2] + κF_*/n)$ when $n=\widetilde{Ω}(κd)$. As a result, we obtain an $O(κ/n^2)$ risk bound under the condition that $n$ is large and $F_*$ is small, which to the best of our knowledge, is the first $O(1/n^2)$-type of risk bound of ERM. Third, we stress that the above results are established in a unified framework, which allows us to derive new risk bounds under weaker conditions, e.g., without convexity of the random function. Finally, we demonstrate that to achieve an $O(1/[λn^2] + κF_*/n)$ risk bound for supervised learning, the $\widetilde{Ω}(κd)$ requirement on $n$ can be replaced with $Ω(κ^2)$, which is dimensionality-independent.

Learning Non-Discriminatory Predictors

Sun, 18 Jun 2017 00:00:00 +0000

We consider learning a predictor which is non-discriminatory with respect to a “protected attribute” according to the notion of “equalized odds” proposed by Hardt et al. (2016). We study the problem of learning such a non-discriminatory predictor from a finite training set, both statistically and computationally. We show that a post-hoc correction approach, as suggested by Hardt et al, can be highly suboptimal, present a nearly-optimal statistical procedure, argue that the associated computational problem is intractable, and suggest a second moment relaxation of the non-discrimination definition for which learning is tractable.

Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox

Sun, 18 Jun 2017 00:00:00 +0000

We present and analyze statistically optimal, communication and memory efficient distributed stochastic optimization algorithms with near-linear speedups (up to $\log$-factors). This improves over prior work which includes methods with near-linear speedups but polynomial communication requirements (accelerated minibatch SGD) and communication efficient methods which do not exhibit any runtime speedups over a naive single-machine approach. We first analyze a distributed SVRG variant as a distributed stochastic optimization method and show that it can achieve near-linear speedups with logarithmic rounds of communication, at the cost of high memory requirements. We then present a novel method, MB-DSVRG, which trades off memory for communication and still allows for optimization with communication which scales only logarithmically with the desired accuracy while also being memory efficient. MB-DSVRG is based on a minibatch prox procedure, solving a non-linearized subproblem on a minibatch at each iteration. We provide a novel analysis for this procedure which achieves the statistical optimal rate regardless of minibatch size and smoothness, and thus significantly improving on prior work.

Ignoring Is a Bliss: Learning with Large Noise Through Reweighting-Minimization

Sun, 18 Jun 2017 00:00:00 +0000

We consider learning in the presence of arbitrary noise that can overwhelm the signal in terms of magnitude on a fraction of data points observed (aka outliers). Standard approaches based on minimizing empirical loss can fail miserably and lead to arbitrary bad solutions in this setting. We propose an approach that iterates between finding a solution with minimal empirical loss and re-weighting the data, reinforcing data points where the previous solution works well. We show that our approach can handle arbitrarily large noise, is robust as having a non-trivial breakdown point, and converges linearly under certain conditions. The intuitive idea of our approach is to automatically exclude “difficult” data points from model fitting. More importantly (and perhaps surprisingly), we validate this intuition by establishing guarantees for generalization and iteration complexity that \em essentially ignore the presence of outliers

On Learning vs. Refutation

Sun, 18 Jun 2017 00:00:00 +0000

Building on the work of Daniely et al. (STOC 2014, COLT 2016), we study the connection between computationally efficient PAC learning and refutation of constraint satisfaction problems. Specifically, we prove that for every concept class $\mathcal{P}$, PAC-learning $\mathcal{P}$ is \em polynomially equivalent to “random-right-hand-side-refuting” (“RRHS-refuting”) a dual class $\mathcal{P}^*$, where RRHS-refutation of a class $\mathcal{Q}$ refers to refuting systems of equations where the constraints are (worst-case) functions from the class $\mathcal{Q}$ but the right-hand-sides of the equations are uniform and independent random bits. The reduction from refutation to PAC learning can be viewed as an abstraction of (part of) the work of Daniely, Linial, and Shalev-Schwartz (STOC 2014). The converse, however, is new, and is based on a combination of techniques from pseudorandomness (Yao ‘82) with boosting (Schapire ‘90). In addition, we show that PAC-learning the class of $\mathit{DNF}$ formulas is polynomially equivalent to PAC-learning its dual class $\mathit{DNF}^*$, and thus PAC-learning $\mathit{DNF}$ is equivalent to RRHS-refutation of $\mathit{DNF}$, suggesting an avenue to obtain stronger lower bounds for PAC-learning $\mathit{DNF}$ than the quasipolynomial lower bound that was obtained by Daniely and Shalev-Schwartz (COLT 2016) assuming the hardness of refuting $k$-SAT.

The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime

Sun, 18 Jun 2017 00:00:00 +0000

We propose a novel technique for analyzing adaptive sampling called the Simulator. Our approach differs from the existing methods by considering not how much information could be gathered by any fixed sampling strategy, but how difficult it is to distinguish a good sampling strategy from a bad one given the limited amount of data collected up to any given time. This change of perspective allows us to match the strength of both Fano and change-of-measure techniques, without succumbing to the limitations of either method. For concreteness, we apply our techniques to a structured multi-arm bandit problem in the fixed-confidence pure exploration setting, where we show that the constraints on the means imply a substantial gap between the moderate-confidence sample complexity, and the asymptotic sample complexity as the confidence delta tends to zero, as found in the literature. We also prove the first instance-based lower bounds for the top-k problem which incorporate the appropriate log-factors. Moreover, our lower bounds zero-in on the number of times each individual arm needs to be pulled, uncovering new phenomena which are drowned out in the aggregate sample complexity. Our new analysis inspires a simple and near-optimal algorithm for the best-arm and top-k identification, the first practical algorithm of its kind for the latter problem which removes extraneous log factors, and outperforms the state-of-the-art in experiments.

An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits

Sun, 18 Jun 2017 00:00:00 +0000

We present a new strategy for gap estimation in randomized algorithms for multiarmed bandits and combine it with the EXP3++ algorithm of Seldin and Slivkins (2014). In the stochastic regime the strategy reduces dependence of regret on a time horizon from $(\ln t)^3$ to $(\ln t)^2$ and eliminates an additive factor of order $∆e^1/∆^2$, where $∆$ is the minimal gap of a problem instance. In the adversarial regime regret guarantee remains unchanged.

Fast and robust tensor decomposition with applications to dictionary learning

Sun, 18 Jun 2017 00:00:00 +0000

We develop fast spectral algorithms for tensor decomposition that match the robustness guarantees of the best known polynomial-time algorithms for this problem based on the sum-of-squares (SOS) semidefinite programming hierarchy. Our algorithms can decompose a 4-tensor with $n$-dimensional orthonormal components in the presence of error with constant spectral norm (when viewed as an $n^2$-by-$n^2$ matrix). The running time is $n^5$ which is close to linear in the input size $n^4$. We also obtain algorithms with similar running time to learn sparsely-used orthogonal dictionaries even when feature representations have constant relative sparsity and non-independent coordinates. The only previous polynomial-time algorithms to solve these problem are based on solving large semidefinite programs. In contrast, our algorithms are easy to implement directly and are based on spectral projections and tensor-mode rearrangements. Or work is inspired by recent of Hopkins, Schramm, Shi, and Steurer (STOC’16) that shows how fast spectral algorithms can achieve the guarantees of SOS for average-case problems. In this work, we introduce general techniques to capture the guarantees of SOS for worst-case problems.

Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization

Sun, 18 Jun 2017 00:00:00 +0000

In this paper, we consider the problem of sequentially optimizing a black-box function $f$ based on noisy samples and bandit feedback. We assume that $f$ is smooth in the sense of having a bounded norm in some reproducing kernel Hilbert space (RKHS), yielding a commonly-considered non-Bayesian form of Gaussian process bandit optimization. We provide algorithm-independent lower bounds on the simple regret, measuring the suboptimality of a single point reported after $T$ rounds, and on the cumulative regret, measuring the sum of regrets over the $T$ chosen points. For the isotropic squared-exponential kernel in $d$ dimensions, we find that an average simple regret of $ε$ requires $T = Ω\big(\frac1ε^2 (\log\frac1ε)^d/2\big)$, and the average cumulative regret is at least $Ω\big( \sqrt{T}(\log T)^d \big)$, thus matching existing upper bounds up to the replacement of $d/2$ by $d+O(1)$ in both cases. For the Matérn-$ν$ kernel, we give analogous bounds of the form $Ω\big( (\frac1ε)^2+d/ν\big)$ and $Ω\big( T^\fracν+ d2ν+ d \big)$, and discuss the resulting gaps to the existing upper bounds.

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Sun, 18 Jun 2017 00:00:00 +0000

We study an equivalence of (i) deterministic pathwise statements appearing in the online learning literature (termed \emphregret bounds), (ii) high-probability tail bounds for the supremum of a collection of martingales (of a specific form arising from uniform laws of large numbers), and (iii) in-expectation bounds for the supremum. By virtue of the equivalence, we prove exponential tail bounds for norms of Banach space valued martingales via deterministic regret bounds for the online mirror descent algorithm with an adaptive step size. We show that the phenomenon extends beyond the setting of online linear optimization and present the equivalence for the supervised online learning setting.

Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

Sun, 18 Jun 2017 00:00:00 +0000

Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration. This modest change allows SGLD to escape local minima and suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-convex objectives. The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks. As in the asymptotic setting, our analysis relates the discrete-time SGLD Markov chain to a continuous-time diffusion process. A new tool that drives the results is the use of weighted transportation cost inequalities to quantify the rate of convergence of SGLD to a stationary distribution in the Euclidean $2$-Wasserstein distance.

Exact tensor completion with sum-of-squares

Sun, 18 Jun 2017 00:00:00 +0000

We obtain the first polynomial-time algorithm for exact tensor completion that improves over the bound implied by reduction to matrix completion. The algorithm recovers an unknown 3-tensor with $r$ incoherent, orthogonal components in $\mathbb R^n$ from $r⋅\tilde O(n^1.5)$ randomly observed entries of the tensor. This bound improves over the previous best one of $r⋅\tilde O(n^2)$ by reduction to exact matrix completion. Our bound also matches the best known results for the easier problem of approximate tensor completion (Barak & Moitra, 2015). Our algorithm and analysis extends seminal results for exact matrix completion (Candes & Recht, 2009) to the tensor setting via the sum-of-squares method. The main technical challenge is to show that a small number of randomly chosen monomials are enough to construct a degree-3 polynomial with precisely planted orthogonal global optima over the sphere and that this fact can be certified within the sum-of-squares proof system.

Sample complexity of population recovery

Sun, 18 Jun 2017 00:00:00 +0000

The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions. We consider one of the two polling impediments: \beginitemize \item in lossy population recovery, a pollee may skip each question with probability $ε$; \item in noisy population recovery, a pollee may lie on each question with probability $ε$. \enditemize Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of all $2^d$ binary vectors simultaneously within accuracy $δ$ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is $\tildeΘ(δ^ -2\max{\fracε1-ε,1})$, improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result is dimension-free. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $ε$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be dimension-dependent and scales as $\exp(Θ(d^1/3 \log^2/3(1/δ)))$ except for the trivial cases of $ε=0,1/2$ or $1$. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam’s method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.

Optimal learning via local entropies and sample compression

Sun, 18 Jun 2017 00:00:00 +0000

Under margin assumptions, we prove several risk bounds, represented via the distribution dependent local entropies of the classes or the sizes of specific sample compression schemes. In some cases, our guarantees are optimal up to constant factors for families of classes. We discuss limitations of our approach and give several applications. In particular, we provide a new tight PAC bound for the hard-margin SVM, an extended analysis of certain empirical risk minimizers under log-concave distributions, a new variant of an online to batch conversion, and distribution dependent localized bounds in the aggregation framework. As a part of our results, we give a new upper bound for the uniform deviations under Bernstein assumptions, which may be of independent interest. The proofs for the sample compression schemes are based on the moment method combined with the analysis of voting algorithms.

Fast rates for online learning in Linearly Solvable Markov Decision Processes

Sun, 18 Jun 2017 00:00:00 +0000

We study the problem of online learning in a class of Markov decision processes known as \emphlinearly solvable MDPs. In the stationary version of this problem, a learner interacts with its environment by directly controlling the state transitions, attempting to balance a fixed state-dependent cost and a certain smooth cost penalizing extreme control inputs. In the current paper, we consider an online setting where the state costs may change arbitrarily between consecutive rounds, and the learner only observes the costs at the end of each respective round. We are interested in constructing algorithms for the learner that guarantee small regret against the best stationary control policy chosen in full knowledge of the cost sequence. Our main result is showing that the smoothness of the control cost enables the simple algorithm of \emphfollowing the leader to achieve a regret of order $\log^2 T$ after $T$ rounds, vastly improving on the best known regret bound of order $T^3/4$ for this setting.

Mixing Implies Lower Bounds for Space Bounded Learning

Sun, 18 Jun 2017 00:00:00 +0000

One can learn any hypothesis class H with O(log |H|) labeled examples. Alas, learning with so few examples requires saving the examples in memory, and this requires |X|^(O(log|H|)) memory states, where X is the set of all labeled examples. This motivates the question of how many labeled examples are needed in case the memory is bounded. Previous work showed, using techniques such as linear algebra and Fourier analysis, that parities cannot be learned with bounded memory and less than |H|^(Omega(1)) examples. One might wonder whether a general combinatorial condition exists for unlearnability with bounded memory, as we have with the condition VCdim(H) = Infinity for PAC unlearnability. In this paper we give such a condition. We show that if an hypothesis class H, when viewed as a bipartite graph between hypotheses H and labeled examples X, is mixing, then learning it requires |H|^(Omega(1)) examples under a certain bound on the memory. Note that the class of parities is mixing. Moreover, as an immediate corollary, we get that most hypothesis classes are unlearnable with bounded memory. Our proof technique is combinatorial in nature and very different from previous analyses.

Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality

Sun, 18 Jun 2017 00:00:00 +0000

A number of statistical estimation problems can be addressed by semidefinite programs (SDP). While SDPs are solvable in polynomial time using interior point methods, in practice generic SDP solvers do not scale well to high-dimensional problems. In order to cope with this problem, Burer and Monteiro proposed a non-convex rank-constrained formulation, which has good performance in practice but is still poorly understood theoretically. In this paper we study the rank-constrained version of SDPs arising in MaxCut and in $\mathbb Z_2$ and $\rm SO(d)$ synchronization problems. We establish a Grothendieck-type inequality that proves that all the local maxima and dangerous saddle points are within a small multiplicative gap from the global maximum. We use this structural information to prove that SDPs can be solved within a known accuracy, by applying the Riemannian trust-region method to this non-convex problem, while constraining the rank to be of order one. For the MaxCut problem, our inequality implies that any local maximizer of the rank-constrained SDP provides a $(1 - 1/(k-1)) \times 0.878$ approximation of the MaxCut, when the rank is fixed to $k$. We then apply our results to data matrices generated according to the Gaussian $\mathbb Z_2$ synchronization problem, and the two-groups stochastic block model with large bounded degree. We prove that the error achieved by local maximizers undergoes a phase transition at the same threshold as for information-theoretically optimal methods.

A Second-order Look at Stability and Generalization

Sun, 18 Jun 2017 00:00:00 +0000

Using differentiability assumptions on the loss function and a concentration inequality for bounded second order differences it is shown that the generalization error for classification with L2 regularisation obeys a Bernstein-type inequality.

Inapproximability of VC Dimension and Littlestone’s Dimension

Sun, 18 Jun 2017 00:00:00 +0000

We study the complexity of computing the VC Dimension and Littlestone’s Dimension. Given an explicit description of a finite universe and a concept class (a binary matrix whose $(x,C)$-th entry is $1$ iff element $x$ belongs to concept $C$), both can be computed exactly in quasi-polynomial time ($n^O(\log n)$). Assuming the randomized Exponential Time Hypothesis (ETH), we prove nearly matching lower bounds on the running time, that hold even for \em approximation algorithms.

Noisy Population Recovery from Unknown Noise

Sun, 18 Jun 2017 00:00:00 +0000

The noisy population recovery problem is a statistical inference problem, which is a special case of the problem of learning mixtures of product distributions. Given an unknown distribution on $n$-bit strings with support of size $k$, and given access only to noisy samples from it, where each bit is flipped independently with some unknown noise probability, estimate from a few samples the underlying parameters of the model. Previous work [De et al., FOCS 2016] designed polynomial time algorithms which work under the assumption that the noise parameters are known exactly. In this work, we remove this assumption, and show how to recover the underlying parameters, even when the noise is unknown, in quasi-polynomial time.

Adaptivity to Noise Parameters in Nonparametric Active Learning

Sun, 18 Jun 2017 00:00:00 +0000

This work addresses various open questions in the theory of active learning for nonparametric classification. Our contributions are both statistical and algorithmic: \beginitemize \item We establish new minimax-rates for active learning under common noise conditions. These rates display interesting transitions – due to the interaction between noise smoothness and margin – not present in the passive setting. Some such transitions were previously conjectured, but remained unconfirmed. \item We present a generic algorithmic strategy for adaptivity to unknown noise smoothness and margin; our strategy achieves optimal rates in many general situations; furthermore, unlike in previous work, we avoid the need for adaptive confidence sets, resulting in strictly milder distributional requirements. \enditemize

Robust and Proper Learning for Mixtures of Gaussians via Systems of Polynomial Inequalities

Sun, 18 Jun 2017 00:00:00 +0000

Learning a Gaussian mixture model (GMM) is a fundamental statistical problem. One common notion of learning a GMM is proper learning: here, the goal is to find a mixture of $k$ Gaussians $\mathcal{M}$ that is close to the unknown density $f$ from which we draw samples. The distance between $\mathcal{M}$ and $f$ is often measured in the total variation / $L_1$-distance. Our main result is an algorithm for learning a mixture of $k$ univariate Gaussians that is \emphnearly-optimal for any fixed $k$. It is well known that the sample complexity of properly learning a univariate $k$-GMM is $O(k / ε^2)$. However, the best prior \emphrunning time for this problem is $\widetilde{O}(1 / ε^3k-1)$; in particular, the dependence between $1/ε$ and $k$ is exponential. In this paper, we significantly improve this dependence by replacing the $1/ε$ term with $\log 1/ε$, while only increasing the exponent moderately. Specifically, the running time of our algorithm is $(k ⋅\log 1/ ε)^O(k^4) + \widetilde{O}(k / ε^2)$. For any fixed $k$, the $\widetilde{O}(k / ε^2)$ term dominates our running time, and thus our algorithm runs in time which is \emphnearly-linear in the number of samples drawn. Achieving a running time of $\text{poly}(k, 1 / ε)$ for proper learning of $k$-GMMs has recently been stated as an open problem by multiple researchers, and we make progress on this question. Our main algorithmic ingredient is a new connection between proper learning of parametric distributions and systems of polynomial inequalities. We leverage results for piecewise polynomial approximation of GMMs and reduce the learning problem to a much smaller sub-problem. While tihs sub-problem is still non-convex, its size depends only logarithmically on the final accuracy $ε$. Hence we can invoke computationally expensive methods for solving the sub-problem. We show that our connection is also useful in the multivariate setting, where we get new results for learning a mixture of two spherical Gaussians. A variant of our approach is also within reach of modern computer algebra systems. Experiments for learning a 2-GMM show promising results: our algorithm improves over the popular Expectation-Maximization (EM) algorithm in the noisy setting.

Fundamental limits of symmetric low-rank matrix estimation

Sun, 18 Jun 2017 00:00:00 +0000

We consider the high-dimensional inference problem where the signal is a low-rank symmetric matrix which is corrupted by an additive Gaussian noise. Given a probabilistic model for the low-rank matrix, we compute the limit in the large dimension setting for the mutual information between the signal and the observations, as well as the matrix minimum mean square error, while the rank of the signal remains constant. We unify and generalize a number of recent works on PCA, sparse PCA, submatrix localization or community detection by computing the information-theoretic limits for these problems in the high noise regime. This allows to locate precisely the information-theoretic thresholds for the above mentioned problems.

On the Ability of Neural Nets to Express Distributions

Sun, 18 Jun 2017 00:00:00 +0000

Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution—also theoretically not understood—concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks. We take a first cut at explaining the expressivity of multilayer nets by giving a sufficient criterion for a function to be approximable by a neural network with $n$ hidden layers. A key ingredient is Barron’s Theorem (Barron, 1993), which gives a Fourier criterion for approximability of a function by a neural network with 1 hidden layer. We show that a composition of $n$ functions which satisfy certain Fourier conditions (“Barron functions”) can be approximated by a $n+1$-layer neural network. For probability distributions, this translates into a criterion for a probability distribution to be approximable in Wasserstein distance—a natural metric on probability distributions—by a neural network applied to a fixed base distribution (e.g., multivariate gaussian). Building up recent lower bound work, we also give an example function that shows that composition of Barron functions is more expressive than Barron functions alone.

Sparse Stochastic Bandits

Sun, 18 Jun 2017 00:00:00 +0000

In the classical multi-armed bandit problem, $d$ arms are available to the decision maker who pulls them sequentially in order to maximize his cumulative reward. Guarantees can be obtained on a relative quantity called regret, which scales linearly with $d$ (or with $\sqrt{d}$ in the minimax sense). We here consider the \emphsparse case of this classical problem in the sense that only a small number of arms, namely $s

Bandits with Movement Costs and Adaptive Pricing

Sun, 18 Jun 2017 00:00:00 +0000

We extend the model of Multi-Armed Bandit with unit switching cost to incorporate a metric between the actions. We consider the case where the metric over the actions can be modeled by a complete binary tree, and the distance between two leaves is the size of the subtree of their least common ancestor, which abstracts the case that the actions are points on the continuous interval $[0,1]$ and the switching cost is their distance. In this setting, we give a new algorithm that establishes a regret of $\widetilde{O}(\sqrt{k}T + T/k)$, where $k$ is the number of actions and $T$ is the time horizon. When the set of actions corresponds to whole $[0,1]$ interval we can exploit our method for the task of bandit learning with Lipschitz loss functions, where our algorithm achieves an optimal regret rate of $\widetilde{Θ}(T^2/3)$, which is the same rate one obtains when there is no penalty for movements. As our main application, we use our new algorithm to solve an adaptive pricing problem. Specifically, we consider the case of a single seller faced with a stream of patient buyers. Each buyer has a private value and a window of time in which they are interested in buying, and they buy at the lowest price in the window, if it is below their value. We show that with an appropriate discretization of the prices, the seller can achieve a regret of $\widetilde{O}(T^2/3)$ compared to the best fixed price in hindsight, which outperform the previous regret bound of $\widetilde{O}(T^3/4)$ for the problem.

Predicting with Distributions

Sun, 18 Jun 2017 00:00:00 +0000

We consider a new learning model in which a joint distribution over vector pairs $(x,y)$ is determined by an unknown function $c(x)$ that maps input vectors $x$ not to individual outputs, but to entire \em distributions\/ over output vectors $y$. Our main results take the form of rather general reductions from our model to algorithms for PAC learning the function class and the distribution class separately, and show that virtually every such combination yields an efficient algorithm in our model. Our methods include a randomized reduction to classification noise and an application of Le Cam’s method to obtain robust learning algorithms.

The Hidden Hubs Problem

Sun, 18 Jun 2017 00:00:00 +0000

We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify the hubs efficiently. The planted Gaussian Submatrix Model is the special case where the higher variance entries must all lie in a $k \times k$ submatrix. If $k≥c\sqrt{n}\ln n$, just the row sums are sufficient to find $S$ in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a $\sqrt\ln n$ factor to $k \ge c\sqrt{n}$ by spectral or combinatorial methods. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^0.5-δ$ for some $δ>0$, when $\sigma_1^2>2\sigma_0^2$. The algorithm extends to the setting where planted entries might have different variances, each at least $\sigma_1^2$. We also show a nearly matching lower bound: for $\sigma_1^2 \le 2\sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,\sigma_0^2)$ and a matrix with $k=n^0.5-δ$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $\sigma_1^2=2\sigma_0^2$, we show that the hidden hubs problem can be solved for $k≥c\sqrt n(\ln n)^1/4$, improving on the naive row sum-based method.

Preface: Conference on Learning Theory (COLT), 2017

Sun, 18 Jun 2017 00:00:00 +0000

A Unified Analysis of Stochastic Optimization Methods Using Jump System Theory and Quadratic Constraints

Sun, 18 Jun 2017 00:00:00 +0000

We develop a simple routine unifying the analysis of several important recently-developed stochastic optimization methods including SAGA, Finito, and stochastic dual coordinate ascent (SDCA). First, we show an intrinsic connection between stochastic optimization methods and dynamic jump systems, and propose a general jump system model for stochastic optimization methods. Our proposed model recovers SAGA, SDCA, Finito, and SAG as special cases. Then we combine jump system theory with several simple quadratic inequalities to derive sufficient conditions for convergence rate certifications of the proposed jump system model under various assumptions (with or without individual convexity, etc). The derived conditions are linear matrix inequalities (LMIs) whose size roughly scale with the size of the training set. We make use of the symmetry in the stochastic optimization methods and reduce these LMIs to some equivalent small LMIs whose sizes are at most 3 by 3. We solve these small LMIs to provide analytical proofs of new convergence rates for SAGA, Finito and SDCA (with or without individual convexity). We also explain why our proposed LMI fails in analyzing SAG. We reveal a key difference between SAG and other methods, and briefly discuss how to extend our LMI analysis for SAG. An advantage of our approach is that the proposed analysis can be automated for a large class of stochastic methods under various assumptions (with or without individual convexity, etc).

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes

Sun, 18 Jun 2017 00:00:00 +0000

In this work we study the quantitative relation between the recursive teaching dimension (RTD) and the VC dimension (VCD) of concept classes of finite sizes. The RTD of a concept class $\mathcal C⊆{0,1}^n$ , introduced by Zilles et al. (2011), is a combinatorial complexity measure characterized by the worst-case number of examples necessary to identify a concept in $\mathcal C$ according to the recursive teaching model. For any finite concept class $\mathcal C⊆{0,1}^n$ with $\mathrm{VCD}(\mathcal C) = d$, Simon and Zilles (2015) posed an open problem $\mathrm{RTD}(\mathcal C) = O(d)$, i.e., is RTD linearly upper bounded by VCD? Previously, the best known result is an exponential upper bound $\mathrm{RTD}(\mathcal C) = O(d\cdot2^d)$, due to Chen et al. (2016). In this paper, we show a quadratic upper bound: $\mathrm{RTD}(\mathcal C) = O(d^2)$, much closer to an answer to the open problem. We also discuss the challenges in fully solving the problem.

Surprising properties of dropout in deep networks

Sun, 18 Jun 2017 00:00:00 +0000

We analyze dropout in deep networks with rectified linear units and the quadratic loss. Our results expose surprising differences between the behavior of dropout and more traditional regularizers like weight decay. For example, on some simple data sets dropout training produces negative weights even though the output is the sum of the inputs. This provides a counterpoint to the suggestion that dropout discourages co-adaptation of weights. We also show that the dropout penalty can grow exponentially in the depth of the network while the weight-decay penalty remains essentially linear, and that dropout is insensitive to various re-scalings of the input features, outputs, and network weights. This last insensitivity implies that there are no isolated local minima of the dropout training criterion. Our work uncovers new properties of dropout, extends our understanding of why dropout succeeds, and lays the foundation for further progress.

Submodular Optimization under Noise

Sun, 18 Jun 2017 00:00:00 +0000

We consider the problem of maximizing a monotone submodular function under noise. Since the 1970s there has been a great deal of work on optimization of submodular functions under various constraints, resulting in algorithms that provide desirable approximation guarantees. In many applications, however, we do not have access to the submodular function we aim to optimize, but rather to some erroneous or noisy version of it. This raises the question of whether provable guarantees are obtainable in the presence of error and noise. We provide initial answers by focusing on the problem of maximizing a monotone submodular function under a cardinality constraint when given access to a noisy oracle of the function. We show that there is an algorithm whose approximation ratio is arbitrarily close to the optimal $1-1/e$ when the cardinality is sufficiently large. The algorithm can be applied in a variety of related problems including maximizing approximately submodular functions, and optimization with correlated noise. When the noise is adversarial we show that no non-trivial approximation guarantee can be obtained.

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Sun, 18 Jun 2017 00:00:00 +0000

We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting $W$ be the number of weights and $L$ be the number of layers, we prove that the VC-dimension is $O(W L \log(W))$, and provide examples with VC-dimension $Ω( W L \log(W/L) )$. This improves both the previously known upper bounds and lower bounds. In terms of the number $U$ of non-linear units, we prove a tight bound $Θ(W U)$ on the VC-dimension. All of these results generalize to arbitrary piecewise linear activation functions.

Fast Rates for Empirical Risk Minimization of Strict Saddle Problems

Sun, 18 Jun 2017 00:00:00 +0000

We derive bounds on the sample complexity of empirical risk minimization (ERM) in the context of minimizing non-convex risks that admit the strict saddle property. Recent progress in non-convex optimization has yielded efficient algorithms for minimizing such functions. Our results imply that these efficient algorithms are statistically stable and also generalize well. In particular, we derive fast rates which resemble the bounds that are often attained in the strongly convex setting. We specify our bounds to Principal Component Analysis and Independent Component Analysis. Our results and techniques may pave the way for statistical analyses of additional strict saddle problems.

Reliably Learning the ReLU in Polynomial Time

Sun, 18 Jun 2017 00:00:00 +0000

We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $\mathbf{x} \mapsto \mathsf{max}(0, \mathbf{w} ⋅\mathbf{x})$ with $\mathbf{w} ∈\mathbb{S}^n-1$. Our algorithm works in the challenging Reliable Agnostic learning model of Kalai, Kanade and Mansour (2012) where the learner is given access to a distribution $\mathcal{D}$ on labeled examples but the labeling may be arbitrary. We construct a hypothesis that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $\mathcal{D}$, for any convex, bounded, and Lipschitz loss function. The algorithm runs in polynomial-time (in $n$) with respect to \em any distribution on $\mathbb{S}^n-1$ (the unit sphere in $n$ dimensions) and for any error parameter $ε= Ω(1 / \log n)$ (this yields a PTAS for a question raised by F. Bach on the complexity of maximizing ReLUs). These results are in contrast to known efficient algorithms for reliably learning linear threshold functions, where $ε$ must be $Ω(1)$ and strong assumptions are required on the marginal distribution. We can compose our results to obtain the first set of efficient algorithms for learning constant-depth networks of ReLU with fixed polynomial-dependence in the dimension. For depth-2 networks of sigmoids, we obtain the first algorithms that have a polynomial dependency in \em all parameters. Our techniques combine kernel methods and polynomial approximations with a “dual-loss” approach to convex programming. As a byproduct we obtain a number of applications including the first set of efficient algorithms for “convex piecewise-linear fitting” and the first efficient algorithms for noisy polynomial reconstruction of low-weight polynomials on the unit sphere.

Effective Semisupervised Learning on Manifolds

Sun, 18 Jun 2017 00:00:00 +0000

The abundance of unlabeled data makes semi-supervised learning (SSL) an attractive approach for improving the accuracy of learning systems. However, we are still far from a complete theoretical understanding of the benefits of this learning scenario in terms of sample complexity. In particular, for many natural learning settings it can in fact be shown that SSL does not improve sample complexity. Thus far, the only case where SSL provably helps, without compatibility assumptions, is a recent combinatorial construction of Darnstadt et al. Deriving similar theoretical guarantees for more commonly used approaches to SSL remains a challenge. Here, we provide the first analysis of manifold based SSL, where there is a provable gap between supervised learning and SSL, and this gap can be arbitrarily large. Proving the required lower bound is a technical challenge, involving tools from geometric measure theory. The algorithm we analyse is similar to subspace clustering, and thus our results demonstrate that this method can be used to improve sample complexity.

Two-Sample Tests for Large Random Graphs Using Network Statistics

Sun, 18 Jun 2017 00:00:00 +0000

We consider a two-sample hypothesis testing problem, where the distributions are defined on the space of undirected graphs, and one has access to only one observation from each model. A motivating example for this problem is comparing the friendship networks on Facebook and LinkedIn. The practical approach to such problems is to compare the networks based on certain network statistics. In this paper, we present a general principle for two-sample hypothesis testing in such scenarios without making any assumption about the network generation process. The main contribution of the paper is a general formulation of the problem based on concentration of network statistics, and consequently, a consistent two-sample test that arises as the natural solution for this problem. We also show that the proposed test is minimax optimal for certain network statistics.

Matrix Completion from $O(n)$ Samples in Linear Time

Sun, 18 Jun 2017 00:00:00 +0000

We consider the problem of reconstructing a rank-$k$ $n \times n$ matrix $M$ from a sampling of its entries. Under a certain incoherence assumption on $M$ and for the case when both the rank and the condition number of $M$ are bounded, it was shown in (Candès and Recht, 2009; Candès and Tao, 2010; Keshavan et al., 2010; Recht, 2011; Jain et al., 2012; Hardt, 2014) that $M$ can be recovered exactly or approximately (depending on some trade-off between accuracy and computational complexity) using $O(n \text{poly}(\log n))$ samples in super-linear time $O(n^a \text{poly}(\log n))$ for some constant $a ≥1$. In this paper, we propose a new matrix completion algorithm using a novel sampling scheme based on a union of independent sparse random regular bipartite graphs. We show that under the same conditions w.h.p. our algorithm recovers an $ε$-approximation of $M$ in terms of the Frobenius norm using $O(n \log^2(1/ε))$ samples and in linear time $O(n \log^2(1/ε))$. This provides the best known bounds both on the sample complexity and computational cost for reconstructing (approximately) an unknown low-rank matrix. The novelty of our algorithm is two new steps of thresholding singular values and rescaling singular vectors in the application of the “vanilla” alternating minimization algorithm. The structure of sparse random regular graphs is used heavily for controlling the impact of these regularization steps.

Memoryless Sequences for Differentiable Losses

Sun, 18 Jun 2017 00:00:00 +0000

One way to define the “randomness” of a fixed individual sequence is to ask how hard it is to predict. When prediction error is measured via squared loss, it has been established that memoryless sequences (which are, in a precise sense, hard to predict) have some of the stochastic attributes of truly random sequences. In this paper, we ask how changing the loss function used changes the set of memoryless sequences, and in particular, the stochastic attributes they possess. We answer this question for differentiable convex loss functions using tools from property elicitation, showing that the property elicited by the loss determines the stochastic attributes of the corresponding memoryless sequences. We apply our results to price calibration in prediction markets.

ZigZag: A New Approach to Adaptive Online Learning

Sun, 18 Jun 2017 00:00:00 +0000

We develop a new family of algorithms for the online learning setting with regret against any data sequence bounded by the empirical Rademacher complexity of that sequence. To develop a general theory of when this type of adaptive regret bound is achievable we establish a connection to the theory of decoupling inequalities for martingales in Banach spaces. When the hypothesis class is a set of linear functions bounded in some norm, such a regret bound is achievable if and only if the norm satisfies certain decoupling inequalities for martingales. Donald Burkholder’s celebrated geometric characterization of decoupling inequalities (1984) states that such an inequality holds if and only if there exists a special function called a Burkholder function satisfying certain restricted concavity properties. Our online learning algorithms are efficient in terms of queries to this function. We realize our general theory by giving new efficient and adaptive algorithms for classes including $\ell_p$ norms, group norms, and reproducing kernel Hilbert spaces. The empirical Rademacher complexity regret bound implies — when used in the i.i.d. setting — a data-dependent complexity bound for excess risk after online-to-batch conversion. To showcase the power of the empirical Rademacher complexity regret bound, we derive improved rates for a supervised learning generalization of the online learning with low rank experts task and for the online matrix prediction task. In addition to obtaining tight data-dependent regret bounds, our algorithms enjoy improved efficiency over previous techniques based on Rademacher complexity, automatically work in the infinite horizon setting, and adapt to scale. To obtain such adaptive methods, we introduce novel machinery, and the resulting algorithms are not based on the standard tools of online convex optimization. We conclude with a number of open problems and new directions, both algorithmic and information-theoretic.

Stochastic Composite Least-Squares Regression with Convergence Rate $O(1/n)$

Sun, 18 Jun 2017 00:00:00 +0000

We consider the minimization of composite objective functions composed of the expectation of quadratic functions and an arbitrary convex function. We study the stochastic dual averaging algorithm with a constant step-size, showing that it leads to a convergence rate of O(1/n) without strong convexity assumptions. This thus extends earlier results on least-squares regression with the Euclidean geometry to (a) all convex regularizers and constraints, and (b) all geometries represented by a Bregman divergence. This is achieved by a new proof technique that relates stochastic and deterministic recursions

Open Problem: Meeting Times for Learning Random Automata

Sun, 18 Jun 2017 00:00:00 +0000

Learning automata is a foundational problem in computational learning theory. However, even efficiently learning random DFAs is hard. A natural restriction of this problem is to consider learning random DFAs under the uniform distribution. To date, this problem has no non-trivial lower bounds nor algorithms faster than brute force. In this note, we propose a method to find faster algorithms for this problem. We reduce the learning problem to a conjecture about meeting times of random walks over random DFAs, which may be of independent interest to prove.

A General Characterization of the Statistical Query Complexity

Sun, 18 Jun 2017 00:00:00 +0000

Statistical query (SQ) algorithms are algorithms that have access to an \em SQ oracle for the input distribution $D$ instead of i.i.d. samples from $D$. Given a query function $φ:X \to [-1,1]$, the oracle returns an estimate of $\bf E_x∼D[φ(x)]$ within some tolerance $\tau_φ$ that roughly corresponds to the number of samples. In this work we demonstrate that the complexity of solving an arbitrary statistical problem using SQ algorithms can be captured by a relatively simple notion of statistical dimension that we introduce. SQ algorithms capture a broad spectrum of algorithmic approaches used in theory and practice, most notably, convex optimization techniques. Hence our statistical dimension allows to investigate the power of a variety of algorithmic approaches by analyzing a single linear-algebraic parameter. Such characterizations were investigated over the past 20 years in learning theory but prior characterizations are restricted to the much simpler setting of classification problems relative to a fixed distribution on the domain. Our characterization is also the first to precisely characterize the necessary tolerance of queries. We give applications of our techniques to two open problems in learning theory and to algorithms that are subject to memory and communication constraints.

Greed Is Good: Near-Optimal Submodular Maximization via Greedy Optimization

Sun, 18 Jun 2017 00:00:00 +0000

It is known that greedy methods perform well for maximizing \textitmonotone submodular functions. At the same time, such methods perform poorly in the face of non-monotonicity. In this paper, we show—arguably, surprisingly—that invoking the classical greedy algorithm $O(\sqrt{k})$-times leads to the (currently) fastest deterministic algorithm, called RepeatedGreedy, for maximizing a general submodular function subject to $k$-independent system constraints. RepeatedGreedy achieves $(1 + O(1/\sqrt{k}))k$ approximation using $O(nr\sqrt{k})$ function evaluations (here, $n$ and $r$ denote the size of the ground set and the maximum size of a feasible solution, respectively). We then show that by a careful sampling procedure, we can run the greedy algorithm only \textitonce and obtain the (currently) fastest randomized algorithm, called SampleGreedy, for maximizing a submodular function subject to $k$-extendible system constraints (a subclass of $k$-independent system constrains). SampleGreedy achieves $(k + 3)$-approximation with only $O(nr/k)$ function evaluations. Finally, we derive an almost matching lower bound, and show that no polynomial time algorithm can have an approximation ratio smaller than $ k + 1/2 - \varepsilon$. To further support our theoretical results, we compare the performance of RepeatedGreedy and SampleGreedy with prior art in a concrete application (movie recommendation). We consistently observe that while SampleGreedy achieves practically the same utility as the best baseline, it performs at least two orders of magnitude faster.

Generalization for Adaptively-chosen Estimators via Stable Median

Sun, 18 Jun 2017 00:00:00 +0000

Datasets are often reused to perform multiple statistical analyses in an adaptive way, in which each analysis may depend on the outcomes of previous analyses on the same dataset. Standard statistical guarantees do not account for these dependencies and little is known about how to provably avoid overfitting and false discovery in the adaptive setting. We consider a natural formalization of this problem in which the goal is to design an algorithm that, given a limited number of i.i.d. samples from an unknown distribution, can answer adaptively-chosen queries about that distribution. We present an algorithm that estimates the expectations of $k$ arbitrary adaptively-chosen real-valued estimators using a number of samples that scales as $\sqrt{k}$. The answers given by our algorithm are essentially as accurate as if fresh samples were used to evaluate each estimator. In contrast, prior work yields error guarantees that scale with the worst-case sensitivity of each estimator. We also give a version of our algorithm that can be used to verify answers to such queries where the sample complexity depends logarithmically on the number of queries $k$ (as in the reusable holdout technique). Our algorithm is based on a simple approximate median algorithm that satisfies the strong stability guarantees of differential privacy. Our techniques provide a new approach for analyzing the generalization guarantees of differentially private algorithms.

Learning Multivariate Log-concave Distributions

Sun, 18 Jun 2017 00:00:00 +0000

We study the problem of estimating multivariate log-concave probability density functions. We prove the first sample complexity upper bound for learning log-concave densities on $\mathbb{R}^d$, for all $d ≥1$. Prior to our work, no upper bound on the sample complexity of this learning problem was known for the case of $d>3$. In more detail, we give an estimator that, for any $d \ge 1$ and $ε>0$, draws $\tilde{O}_d \left( (1/ε)^(d+5)/2 \right)$ samples from an unknown target log-concave density on $R^d$, and outputs a hypothesis that (with high probability) is $ε$-close to the target, in total variation distance. Our upper bound on the sample complexity comes close to the known lower bound of $\Omega_d \left( (1/ε)^(d+1)/2 \right)$ for this problem.

High Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transtition

Sun, 18 Jun 2017 00:00:00 +0000

We consider a sparse linear regression model $Y=Xβ^*+W$ where $X$ is $n\times p$ matrix Gaussian i.i.d. entries, $W$ is $n\times 1$ noise vector with i.i.d. mean zero Gaussian entries and standard deviation $σ$, and $β^*$ is $p\times 1$ binary vector with support size (sparsity) $k$. Using a novel conditional second moment method we obtain a tight up to a multiplicative constant approximation of the optimal squared error $\min_β\|Y-Xβ\|_2$, where the minimization is over all $k$-sparse binary vectors $β$. The approximation reveals interesting structural properties of the underlying regression problem. In particular, \beginenumerate \item [(a)] We establish that $n^*=2k\log p/\log (2k/σ^2+1)$ is a phase transition point with the following “all-or-nothing” property. When $n$ exceeds $n^*$, $(2k)^-1\|\beta_2-β^*\|_0≈0$, and when $n$ is below $n^*$, $(2k)^-1\|\beta_2-β^*\|_0≈1$, where $\beta_2$ is the optimal solution achieving the smallest squared error. As a corollary $n^*$ is the asymptotic threshold for recovering $β^*$ information theoretically. Note that $n^*$ is asymptotically below the threshold $n_\text{LASSO}/CS=(2k+σ^2)\log p$, above which the LASSO and Compressive Sensing methods are able to recover $β^*$. \item [(b)] We compute the squared error for an intermediate problem $\min_β\|Y-Xβ\|_2$ where the minimization is restricted to vectors $β$ with $\|β-β^*\|_0=2k ζ$, for some fixed ratio $ζ∈[0,1]$. We show that a lower bound part $Γ(ζ)$ of the estimate, which essentially corresponds to the estimate based on the first moment method, undergoes a phase transition at three different thresholds, namely $n_\text{inf},1=σ^2\log p$, which is information theoretic bound for recovering $β^*$ when $k=1$ and $σ$ is large, then at $n^*$ and finally at $n_\text{LASSO}/CS$. \item [(c)] We establish a certain Overlap Gap Property (OGP) on the space of all $k$-sparse binary vectors $β$ when $n\le ck\log p$ for sufficiently small constant $c$. By drawing a connection with a similar OGP exhibited by many randomly generated constraint satisfaction problems and statistical physics models, we conjecture that OGP is the source of algorithmic hardness of solving the minimization problem $\min_β\|Y-Xβ\|_2$ in the regime $n

Ten Steps of EM Suffice for Mixtures of Two Gaussians

Sun, 18 Jun 2017 00:00:00 +0000

The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than $1%$ error estimation of the means. In the finite sample regime, we show that, under a random initialization, $\tilde{O}(d/ε^2)$ samples suffice to compute the unknown vectors to within $ε$ in Mahalanobis distance, where $d$ is the dimension. In particular, the error rate of the EM based estimator is $\tilde{O}\left(\sqrt{d} \over n\right)$ where $n$ is the number of samples, which is optimal up to logarithmic factors.

Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing

Sun, 18 Jun 2017 00:00:00 +0000

We show that the square Hellinger distance between two Bayesian networks on the same directed graph, $G$, is subadditive with respect to the neighborhoods of $G$. Namely, if $P$ and $Q$ are the probability distributions defined by two Bayesian networks on the same DAG, our inequality states that the square Hellinger distance, $H^2(P,Q)$, between $P$ and $Q$ is upper bounded by the sum, $\sum_v H^2(P_{v} ∪\Pi_v, Q_{v} ∪\Pi_v)$, of the square Hellinger distances between the marginals of $P$ and $Q$ on every node $v$ and its parents $\Pi_v$ in the DAG. Importantly, our bound does not involve the conditionals but the marginals of $P$ and $Q$. We derive a similar inequality for more general Markov Random Fields. As an application of our inequality, we show that distinguishing whether two (unknown) Bayesian networks $P$ and $Q$ on the same (but potentially unknown) DAG satisfy $P=Q$ vs $d_\rm TV(P,Q)>ε$ can be performed from $\tilde{O}(|Σ|^3/4(d+1) ⋅n/ε^2)$ samples, where $d$ is the maximum in-degree of the DAG and $Σ$ the domain of each variable of the Bayesian networks. If $P$ and $Q$ are defined on potentially different and potentially unknown trees, the sample complexity becomes $\tilde{O}(|Σ|^4.5 n/ε^2)$. In both cases the dependence of the sample complexity on $n, ε$ is optimal up to logarithmic factors. Lastly, if $P$ and $Q$ are product distributions over ${0,1}^n$ and $Q$ is known, the sample complexity becomes $O(\sqrt{n}/ε^2)$, which is optimal up to constant factors.

Depth Separation for Neural Networks

Sun, 18 Jun 2017 00:00:00 +0000

Let $f:\mathbb{S}^d-1\times \mathbb{S}^d-1\to\mathbb{S}$ be a function of the form $f(x,x’) = g(⟨x,x’⟩)$ for $g:[-1,1]\to \mathbb{R}$. We give a simple proof that shows that poly-size depth two neural networks with (exponentially) bounded weights cannot approximate $f$ whenever $g$ cannot be approximated by a low degree polynomial. Moreover, for many $g$’s, such as $g(x)=\sin(\pi d^3x)$, the number of neurons must be $2^Ω\left(d\log(d)\right)$. Furthermore, the result holds w.r.t. the uniform distribution on $\mathbb{S}^d-1\times \mathbb{S}^d-1$. As many functions of the above form can be well approximated by poly-size depth three networks with poly-bounded weights, this establishes a separation between depth two and depth three networks w.r.t. the uniform distribution on $\mathbb{S}^d-1\times \mathbb{S}^d-1$.

Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent

Sun, 18 Jun 2017 00:00:00 +0000

In this paper, we revisit the recently established theoretical guarantees for the convergence of the Langevin Monte Carlo algorithm of sampling from a smooth and (strongly) log-concave density. We improve the existing results when the convergence is measured in the Wasserstein distance and provide further insights on the very tight relations between, on the one hand, the Langevin Monte Carlo for sampling and, on the other hand, the gradient descent for optimization. Finally, we also establish guarantees for the convergence of a version of the Langevin Monte Carlo algorithm that is based on noisy evaluations of the gradient.

Online Learning Without Prior Information

Sun, 18 Jun 2017 00:00:00 +0000

The vast majority of optimization and online learning algorithms today require some prior information about the data (often in the form of bounds on gradients or on the optimal parameter value). When this information is not available, these algorithms require laborious manual tuning of various hyperparameters, motivating the search for algorithms that can adapt to the data with no prior information. We describe a frontier of new lower bounds on the performance of such algorithms, reflecting a tradeoff between a term that depends on the optimal parameter value and a term that depends on the gradients’ rate of growth. Further, we construct a family of algorithms whose performance matches any desired point on this frontier, which no previous algorithm reaches.

Tight Bounds for Bandit Combinatorial Optimization

Sun, 18 Jun 2017 00:00:00 +0000

We revisit the study of optimal regret rates in bandit combinatorial optimization—a fundamental framework for sequential decision making under uncertainty that abstracts numerous combinatorial prediction problems. We prove that the attainable regret in this setting grows as $\widetildeΘ(k^3/2\sqrt{d}T)$ where $d$ is the dimension of the problem and $k$ is a bound over the maximal instantaneous loss, disproving a conjecture of Audibert, Bubeck, and Lugosi (2013) who argued that the optimal rate should be of the form $\widetildeΘ(k\sqrt{d}T)$. Our bounds apply to several important instances of the framework, and in particular, imply a tight bound for the well-studied bandit shortest path problem. By that, we also resolve an open problem posed by Cesa-Bianchi and Lugosi (2012).

Thresholding Based Outlier Robust PCA

Sun, 18 Jun 2017 00:00:00 +0000

We consider the problem of outlier robust PCA (\textbfOR-PCA) where the goal is to recover principal directions despite the presence of outlier data points. That is, given a data matrix $M^*$, where $(1-α)$ fraction of the points are noisy samples from a low-dimensional subspace while $α$ fraction of the points can be arbitrary outliers, the goal is to recover the subspace accurately. Existing results for \textbfOR-PCA have serious drawbacks: while some results are quite weak in the presence of noise, other results have runtime quadratic in dimension, rendering them impractical for large scale applications. In this work, we provide a novel thresholding based iterative algorithm with per-iteration complexity at most linear in the data size. Moreover, the fraction of outliers, $α$, that our method can handle is tight up to constants while providing nearly optimal computational complexity for a general noise setting. For the special case where the inliers are obtained from a low-dimensional subspace with additive Gaussian noise, we show that a modification of our thresholding based method leads to significant improvement in recovery error (of the subspace) even in the presence of a large fraction of outliers.

Towards Instance Optimal Bounds for Best Arm Identification

Sun, 18 Jun 2017 00:00:00 +0000

In the classical best arm identification (Best-$1$-Arm) problem, we are given $n$ stochastic bandit arms, each associated with a reward distribution with an unknown mean. Upon each play of an arm, we can get a reward sampled i.i.d. from its reward distribution. We would like to identify the arm with the largest mean with probability at least $1-δ$, using as few samples as possible. The problem has a long history and understanding its sample complexity has attracted significant attention since the last decade. However, the optimal sample complexity of the problem is still unknown. Recently, Chen and Li (2016) made an interesting conjecture, called gap-entropy conjecture, concerning the instance optimal sample complexity of Best-$1$-Arm. Given a Best-$1$-Arm instance $I$ (i.e., a set of arms), let $\mu_[i]$ denote the $i$th largest mean and $\Delta_[i]=\mu_[1]-\mu_[i]$ denote the corresponding gap. $H(I)=\sum_i=2^n\Delta_[i]^-2$ denotes the complexity of the instance. The gap-entropy conjecture states that for any instance $I$, $Ω\left(H(I)⋅\left(\lnδ^-1 + \mathsf{Ent}(I)\right)\right)$ is an instance lower bound, where $\mathsf{Ent}(I)$ is an entropy-like term determined by the gaps, and there is a $δ$-correct algorithm for Best-$1$-Arm with sample complexity $O\left(H(I)⋅\left(\lnδ^-1 + \mathsf{Ent}(I)\right)+\Delta_[2]^-2\ln\ln\Delta_[2]^-1\right)$. We note that $Θ\left(\Delta_[2]^-2\ln\ln\Delta_[2]^-1\right)$ is necessary and sufficient to solve the two-arm instance with the best and second best arms. If the conjecture is true, we would have a complete understanding of the instance-wise sample complexity of Best-$1$-Arm (up to constant factors). In this paper, we make significant progress towards a complete resolution of the gap-entropy conjecture. For the upper bound, we provide a highly nontrivial algorithm which requires \[O\left(H(I)⋅\left(\lnδ^-1 + \mathsf{Ent}(I)\right)+\Delta_[2]^-2\ln\ln\Delta_[2]^-1\mathrmpolylog(n,δ^-1)\right)\]samples in expectation for any instance $I$. For the lower bound, we show that for any Gaussian Best-$1$-Arm instance with gaps of the form $2^-k$, any $δ$-correct monotone algorithm requires at least \[Ω\left(H(I)⋅\left(\lnδ^-1 + \mathsf{Ent}(I)\right)\right)\]samples in expectation. Here, a monotone algorithm is one which uses no more samples (in expectation) on $I’$ than on $I$, if $I’$ is a sub-instance of $I$ obtained by removing some sub-optimal arms.

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration

Sun, 18 Jun 2017 00:00:00 +0000

We study the combinatorial pure exploration problem \textscBest-Set in a stochastic multi-armed bandit game. In an \textscBest-Set instance, we are given $n$ stochastic arms with unknown reward distributions, as well as a family $\mathcal{F}$ of feasible subsets over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify the feasible subset in $\mathcal{F}$ with the maximum total weight, using as few samples as possible. The problem generalizes the classical best arm identification problem and the top-$k$ arm identification problem, both of which have attracted significant attention in recent years. We provide a novel \textitinstance-wise lower bound for the sample complexity of the problem, as well as a nontrivial sampling algorithm, matching the lower bound up to a factor of $\ln|\mathcal{F}|$. For an important class of combinatorial families (including spanning trees, matchings, and path constraints), we also provide polynomial time implementation of the sampling algorithm, using the equivalence of separation and optimization for convex program, and the notion of approximate Pareto curves in multi-objective optimization (note that $|\mathcal{F}|$ can be exponential in $n$). We also show that the $\ln|\mathcal{F}|$ factor is inevitable in general, through a nontrivial lower bound construction utilizing a combinatorial structure resembling the Nisan-Wigderson design. Our results significantly improve several previous results for several important combinatorial constraints, and provide a tighter understanding of the general \textscBest-Set problem. We further introduce an even more general problem, formulated in geometric terms. We are given $n$ Gaussian arms with unknown means and unit variance. Consider the $n$-dimensional Euclidean space $\mathbb{R}^n$, and a collection $\mathcal{O}$ of disjoint subsets. Our goal is to determine the subset in $\mathcal{O}$ that contains the mean profile (which is the $n$-dimensional vector of the means), using as few samples as possible. The problem generalizes most pure exploration bandit problems studied in the literature. We provide the first nearly optimal sample complexity upper and lower bounds for the problem.

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Sun, 18 Jun 2017 00:00:00 +0000

We investigate contextual online learning with nonparametric (Lipschitz) comparison classes under different assumptions on losses and feedback information. For full information feedback and Lipschitz losses, we design the first explicit algorithm achieving the minimax regret rate (up to log factors). In a partial feedback model motivated by second-price auctions, we obtain algorithms for Lipschitz and semi-Lipschitz losses with regret bounds improving on the known bounds for standard bandit feedback. Our analysis combines novel results for contextual second-price auctions with a novel algorithmic approach based on chaining. When the context space is Euclidean, our chaining approach is efficient and delivers an even better regret bound.

Multi-Observation Elicitation

Sun, 18 Jun 2017 00:00:00 +0000

We study loss functions that measure the accuracy of a prediction based on multiple data points simultaneously. To our knowledge, such loss functions have not been studied before in the area of property elicitation or in machine learning more broadly. As compared to traditional loss functions that take only a single data point, these multi-observation loss functions can in some cases drastically reduce the dimensionality of the hypothesis required. In elicitation, this corresponds to requiring many fewer reports; in empirical risk minimization, it corresponds to algorithms on a hypothesis space of much smaller dimension. We explore some examples of the tradeoff between dimensionality and number of observations, give some geometric characterizations and intuition for relating loss functions and the properties that they elicit, and discuss some implications for both elicitation and machine-learning contexts.

Testing Bayesian Networks

Sun, 18 Jun 2017 00:00:00 +0000

This work initiates a systematic investigation of testing \em high-dimensional structured distributions by focusing on testing \em Bayesian networks – the prototypical family of directed graphical models. A Bayesian network is defined by a directed acyclic graph, where we associate a random variable with each node. The value at any particular node is conditionally independent of all the other non-descendant nodes once its parents are fixed. Specifically, we study the properties of identity testing and closeness testing of Bayesian networks. Our main contribution is the first non-trivial efficient testing algorithms for these problems and corresponding information-theoretic lower bounds. For a wide range of parameter settings, our testing algorithms have sample complexity \em sublinear in the dimension and are sample-optimal, up to constant factors.

Learning Disjunctions of Predicates

Sun, 18 Jun 2017 00:00:00 +0000

Let $\mathcal F$ be a set of boolean functions. We give an algorithm for learning $\mathcal F_∨:={\vee_f∈Sf | S⊆\mathcal {F}}$ from membership queries. Our algorithm asks at most $|\mathcal {F}|⋅\rm OPT(\mathcal {F}_∨)$ membership queries where $\rm OPT(\mathcal{F}_∨)$ is the minimum worst case number of membership queries for learning $\mathcal{F}_∨$. When $\mathcal{F}$ is a set of halfspaces over a constant dimension space or a set of variable inequalities, our algorithm runs in polynomial time. The problem we address has a practical importance in the field of program synthesis, where the goal is to synthesize a program meeting some requirements. Program synthesis has become popular especially in settings aimed to help end users. In such settings, the requirements are not provided upfront and the synthesizer can only learn them by posing membership queries to the end user. Our work completes such synthesizers with the ability to learn the exact requirements while bounding the number of membership queries.

Rates of estimation for determinantal point processes

Sun, 18 Jun 2017 00:00:00 +0000

Determinantal point processes (DPPs) have wide-ranging applications in machine learning, where they are used to enforce the notion of diversity in subset selection problems. Many estimators have been proposed, but surprisingly the basic properties of the maximum likelihood estimator (MLE) have received little attention. In this paper, we study the local geometry of the expected log-likelihood function to prove several rates of convergence for the MLE. We also give a complete characterization of the case where the MLE converges at a parametric rate. Even in the latter case, we also exhibit a potential curse of dimensionality where the asymptotic variance of the MLE is exponentially large in the dimension of the problem.

Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo

Sun, 18 Jun 2017 00:00:00 +0000

This paper presents a detailed theoretical analysis of the Langevin Monte Carlo sampling algorithm recently introduced in Durmus et al. (Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau, 2016) when applied to log-concave probability distributions that are restricted to a convex body $K$. This method relies on a regularisation procedure involving the Moreau-Yosida envelope of the indicator function associated with $K$. Explicit convergence bounds in total variation norm and in Wasserstein distance of order $1$ are established. In particular, we show that the complexity of this algorithm given a first order oracle is polynomial in the dimension of the state space. Finally, some numerical experiments are presented to compare our method with competing MCMC approaches from the literature.

Efficient Co-Training of Linear Separators under Weak Dependence

Sun, 18 Jun 2017 00:00:00 +0000

We develop the first polynomial-time algorithm for co-training of homogeneous linear separators under \em weak dependence, a relaxation of the condition of independence given the label. Our algorithm learns from purely unlabeled data, except for a single labeled example to break symmetry of the two classes, and works for any data distribution having an inverse-polynomial margin and with center of mass at the origin.

The Sample Complexity of Optimizing a Convex Function

Sun, 18 Jun 2017 00:00:00 +0000

In this paper we study optimization from samples of convex functions. There are many scenarios in which we do not know the function we wish to optimize but can learn it from data. In such cases, we are interested in bounding the number of samples required to optimize the function. Our main result shows that in general, the number of samples required to obtain a non-trivial approximation to the optimum of a convex function is exponential in its dimension, even when the function is PAC-learnable. We also obtain strong lower bounds for strongly convex and Lipschitz continuous functions. On the positive side, we show that there are interesting classes of functions and distributions for which the sample complexity is polynomial in the dimension of the function.

Learning-Theoretic Foundations of Algorithm Configuration for Combinatorial Partitioning Problems

Sun, 18 Jun 2017 00:00:00 +0000

Max-cut, clustering, and many other partitioning problems that are of significant importance to machine learning and other scientific fields are NP-hard, a reality that has motivated researchers to develop a wealth of approximation algorithms and heuristics. Although the best algorithm to use typically depends on the specific application domain, a worst-case analysis is often used to compare algorithms. This may be misleading if worst-case instances occur infrequently, and thus there is a demand for optimization methods which return the algorithm configuration best suited for the given application’s typical inputs. Recently, Gupta and Roughgarden introduced the first learning-theoretic framework to rigorously study this problem, using it to analyze classes of greedy heuristics, parameter tuning in gradient descent, and other problems. We study this algorithm configuration problem for clustering, max-cut, and other partitioning problems, such as integer quadratic programming, by designing computationally efficient and sample efficient learning algorithms which receive samples from an application-specific distribution over problem instances and learn a partitioning algorithm with high expected performance. Our algorithms learn over common integer quadratic programming and clustering algorithm families: SDP rounding algorithms and agglomerative clustering algorithms with dynamic programming. For our sample complexity analysis, we provide tight bounds on the pseudodimension of these algorithm classes, and show that surprisingly, even for classes of algorithms parameterized by a single parameter, the pseudo-dimension is superconstant. In this way, our work both contributes to the foundations of algorithm configuration and pushes the boundaries of learning theory, since the algorithm classes we analyze consist of multi-stage optimization procedures and are significantly more complex than classes typically studied in learning theory.

Computationally Efficient Robust Sparse Estimation in High Dimensions

Sun, 18 Jun 2017 00:00:00 +0000

Many conventional statistical procedures are extremely sensitive to seemingly minor deviations from modeling assumptions. This problem is exacerbated in modern high-dimensional settings, where the problem dimension can grow with and possibly exceed the sample size. We consider the problem of robust estimation of sparse functionals, and provide a computationally and statistically efficient algorithm in the high-dimensional setting. Our theory identifies a unified set of deterministic conditions under which our algorithm guarantees accurate recovery. By further establishing that these deterministic conditions hold with high-probability for a wide range of statistical models, our theory applies to many problems of considerable interest including sparse mean and covariance estimation; sparse linear regression; and sparse generalized linear models. In certain settings, such as the detection and estimation of sparse principal components in the spiked covariance model, our general theory does not yield optimal sample complexity, and we provide a novel algorithm based on the same intuition which is able to take advantage of further structure of the problem to achieve nearly optimal rates.

The Price of Selection in Differential Privacy

Sun, 18 Jun 2017 00:00:00 +0000

In the differentially private top-$k$ selection problem, we are given a dataset $X ∈\pmo^n \times d$, in which each row belongs to an individual and each column corresponds to some binary attribute, and our goal is to find a set of $k ≪d$ columns whose means are approximately as large as possible. Differential privacy requires that our choice of these $k$ columns does not depend too much on any on individual’s dataset. This problem can be solved using the well known exponential mechanism and composition properties of differential privacy. In the high-accuracy regime, where we require the error of the selection procedure to be to be smaller than the so-called sampling error $α≈\sqrt\ln(d)/n$, this procedure succeeds given a dataset of size $n ≳k \ln(d)$. We prove a matching lower bound, showing that a dataset of size $n ≳k \ln(d)$ is necessary for private top-$k$ selection in this high-accuracy regime. Our lower bound shows that selecting the $k$ largest columns requires more data than simply estimating the value of those $k$ columns, which can be done using a dataset of size just $n ≳k$.

Efficient PAC Learning from the Crowd

Sun, 18 Jun 2017 00:00:00 +0000

In recent years crowdsourcing has become the method of choice for gathering labeled training data for learning algorithms. Standard approaches to crowdsourcing view the process of acquiring labeled data separately from the process of learning a classifier from the gathered data. This can give rise to computational and statistical challenges. For example, in most cases there are no known computationally efficient learning algorithms that are robust to the high level of noise that exists in crowdsourced data, and efforts to eliminate noise through voting often require a large number of queries per example. In this paper, we show how by interleaving the process of labeling and learning, we can attain computational efficiency with much less overhead in the labeling cost. In particular, we consider the \em realizable setting where there exists a true target function in $\mathcal{F}$ and consider a pool of labelers. When a noticeable fraction of the labelers are \emphperfect, and the rest behave arbitrarily, we show that any $\mathcal{F}$ that can be efficiently learned in the traditional \em realizable PAC model can be learned in a computationally efficient manner by querying the crowd, despite high amounts of noise in the responses. Moreover, we show that this can be done while each labeler only labels a constant number of examples and the number of labels requested per example, on average, is a constant. When no perfect labelers exist, a related task is to find a set of the labelers which are \emphgood but not perfect. We show that we can identify all good labelers, when at least the majority of labelers are good.

Correspondence retrieval

Sun, 18 Jun 2017 00:00:00 +0000

This article studies the correspondence retrieval problem: a set of $k$ distinct but unknown points $\mathbf{x}_1, \mathbf{x}_2, \dotsc, \mathbf{x}_k ∈\mathbb{R}^d$ are to be recovered from the unordered collection of projection values $⟨\mathbf{w}_i,\mathbf{x}_1 ⟩, ⟨\mathbf{w}_i,\mathbf{x}_2 ⟩, \dotsc, ⟨\mathbf{w}_i,\mathbf{x}_k ⟩$ onto $n$ known measurement vectors $\mathbf{w}_1, \mathbf{w}_2, \dotsc, \mathbf{w}_n$. Importantly, the correspondence of the $k$ projections ${⟨\mathbf{w}_i,\mathbf{x}_j ⟩}_j=1^k$ across different measurements is unknown. A special case of this problem is the well-studied problem of (real-valued) phase retrieval. In the case of independent standard Gaussian measurement vectors, the main algorithm proposed in this work requires $n = d+1$ measurements to correctly return the $k$ unknown points with high probability. This number of measurements is optimal, and it is smaller than the number of measurements required for a stronger “for all” guarantee even in the phase retrieval setting. The algorithm is based on reductions to the Shortest Vector Problem on certain random lattices, and employs the Lenstra, Lenstra, and Lovász (1982) basis reduction algorithm in a manner similar to the Lagarias & Odlyzko (1985) algorithm for solving random instances of Subset Sum. Another algorithm, essentially due to Yi, Caramanis, & Sanghavi (2016), based on higher-order moments and tensor decompositions is shown to work in a setting where the projection values are corrupted by additive Gaussian noise, but it requires a significantly larger number of measurements.

Homotopy Analysis for Tensor PCA

Sun, 18 Jun 2017 00:00:00 +0000

Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the “high noise” regime. The signal-to-noise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the \em best degree-$4$ sum-of-squares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows us to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.

Thompson Sampling for the MNL-Bandit

Sun, 18 Jun 2017 00:00:00 +0000

We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.

Learning with Limited Rounds of Adaptivity: Coin Tossing, Multi-Armed Bandits, and Ranking from Pairwise Comparisons

Sun, 18 Jun 2017 00:00:00 +0000

In many learning settings, active/adaptive querying is possible, but the number of rounds of adaptivity is limited. We study the relationship between query complexity and adaptivity in identifying the $k$ most biased coins among a set of $n$ coins with unknown biases. This problem is a common abstraction of many well-studied problems, including the problem of identifying the $k$ best arms in a stochastic multi-armed bandit, and the problem of top-$k$ ranking from pairwise comparisons. An $r$-round adaptive algorithm for the $k$ most biased coins problem specifies in each round the set of coin tosses to be performed based on the observed outcomes in earlier rounds, and outputs the set of $k$ most biased coins at the end of $r$ rounds. When $r=1$, the algorithm is known as \em non-adaptive; when $r$ is unbounded, the algorithm is known as \em fully adaptive. While the power of adaptivity in reducing query complexity is well known, full adaptivity requires repeated interaction with the coin tossing (feedback generation) mechanism, and is highly sequential, since the set of coins to be tossed in each round can only be determined after we have observed the outcomes of the coin tosses from the previous round. In contrast, algorithms with only few rounds of adaptivity require fewer rounds of interaction with the feedback generation mechanism, and offer the benefits of parallelism in algorithmic decision-making. Motivated by these considerations, we consider the question of how much adaptivity is needed to realize the optimal worst case query complexity for identifying the $k$ most biased coins. Given any positive integer $r$, we derive essentially matching upper and lower bounds on the query complexity of $r$-round algorithms. We then show that $Θ(\log^*n)$ rounds are both necessary and sufficient for achieving the optimal worst case query complexity for identifying the $k$ most biased coins. In particular, our algorithm achieves the optimal query complexity in at most $\log^*n$ rounds, which implies that on any realistic input, $5$ parallel rounds of exploration suffice to achieve the optimal worst-case sample complexity. The best known algorithm prior to our work required $Θ(\log n)$ rounds to achieve the optimal worst case query complexity even for the special case of $k=1$.

Corralling a Band of Bandit Algorithms

Sun, 18 Jun 2017 00:00:00 +0000

We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm \it if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the master not starve a base algorithm that might perform uncompetitively initially but would eventually outperform others if given enough feedback. We address this difficulty by devising a version of Online Mirror Descent with a special mirror map together with a sophisticated learning rate scheme. We show that this approach manages to achieve a more delicate balance between exploiting and exploring base algorithms than previous works yielding superior regret bounds. Our results are applicable to many settings, such as multi-armed bandits, contextual bandits, and convex bandits. As examples, we present two main applications. The first is to create an algorithm that enjoys worst-case robustness while at the same time performing much better when the environment is relatively easy. The second is to create an algorithm that works simultaneously under different assumptions of the environment, such as different priors or different loss structures.

Open Problem: First-Order Regret Bounds for Contextual Bandits

Sun, 18 Jun 2017 00:00:00 +0000

We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of $\tilde{\mathcal{O}}(\sqrt{L_⋆}K \ln N)$ where there are $K$ actions, $N$ policies, and $L_⋆$ is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret $\tilde{\mathcal{O}}(L_⋆^{2/3}poly(K, \ln(N/δ)))$. We describe some positive results, such as an inefficient algorithm for the second problem, and some partial negative results.