Proceedings of Machine Learning Research

First-order Methods for Geodesically Convex Optimization

Mon, 06 Jun 2016 00:00:00 +0000

Geodesic convexity generalizes the notion of (vector space) convexity to nonlinear metric spaces. But unlike convex optimization, geodesically convex (g-convex) optimization is much less developed. In this paper we contribute to the understanding of g-convex optimization by developing iteration complexity analysis for several first-order algorithms on Hadamard manifolds. Specifically, we prove upper bounds for the global complexity of deterministic and stochastic (sub)gradient methods for optimizing smooth and nonsmooth g-convex functions, both with and without strong g-convexity. Our analysis also reveals how the manifold geometry, especially \emphsectional curvature, impacts convergence rates. To the best of our knowledge, our work is the first to provide global complexity analysis for first-order algorithms for general g-convex optimization.

The Extended Littlestone’s Dimension for Learning with Mistakes and Abstentions

Mon, 06 Jun 2016 00:00:00 +0000

This paper studies classification with an abstention option in the online setting. In this setting, examples arrive sequentially, the learner is given a hypothesis class \mathcalH, and the goal of the learner is to either predict a label on each example or abstain, while ensuring that it does not make more than a pre-specified number of mistakes when it does predict a label. Previous work on this problem has left open two main challenges. First, not much is known about the optimality of algorithms, and in particular, about what an optimal algorithmic strategy is for any individual hypothesis class. Second, while the realizable case has been studied, the more realistic non-realizable scenario is not well-understood. In this paper, we address both challenges. First, we provide a novel measure, called the Extended Littlestone’s Dimension, which captures the number of abstentions needed to ensure a certain number of mistakes. Second, we explore the non-realizable case, and provide upper and lower bounds on the number of abstentions required by an algorithm to guarantee a specified number of mistakes.

Online learning in repeated auctions

Mon, 06 Jun 2016 00:00:00 +0000

Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good’s value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical. Comparing our performance against that of the best fixed bid in hindsight, we show that sublinear regret is also achievable in this case. For both the stochastic and adversarial models, we prove matching minimax lower bounds showing our strategies to be optimal up to lower-order terms. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.

A Guide to Learning Arithmetic Circuits

Mon, 06 Jun 2016 00:00:00 +0000

An \empharithmetic circuit is a directed acyclic graph in which the operations are {+,\times}. In this paper, we exhibit several connections between learning algorithms for arithmetic circuits and other problems. In particular, we show that: \beginitemize \item Efficient learning algorithms for arithmetic circuit classes imply explicit exponential lower bounds. \item General circuits and formulas can be learned efficiently with membership and equivalence queries iff they can be learned efficiently with membership queries only. \item Low-query learning algorithms for certain classes of circuits imply explicit rigid matrices. \item Learning algorithms for multilinear depth-3 and depth-4 circuits must compute square roots. \enditemize

benefits of depth in neural networks

Mon, 06 Jun 2016 00:00:00 +0000

For any positive integer k, there exist neural networks with Θ(k^3) layers, Θ(1) nodes per layer, and Θ(1) distinct parameters which can not be approximated by networks with O(k) layers unless they are exponentially large — they must possess Ω(2^k) nodes. This result is proved here for a class of nodes termed \emphsemi-algebraic gates which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with ReLU and maximization gates, sum-product networks, and boosted decision trees (in this last case with a stronger separation: Ω(2^k^3) total tree nodes are required).

Memory, Communication, and Statistical Queries

Mon, 06 Jun 2016 00:00:00 +0000

If a concept class can be represented with a certain amount of memory, can it be efficiently learned with the same amount of memory? What concepts can be efficiently learned by algorithms that extract only a few bits of information from each example? We introduce a formal framework for studying these questions, and investigate the relationship between the fundamental resources of memory or communication and the sample complexity of the learning task. We relate our memory-bounded and communication-bounded learning models to the well-studied statistical query model. This connection can be leveraged to obtain both upper and lower bounds: we show strong lower bounds on learning parity functions with bounded communication, as well as upper bounds on solving sparse linear regression problems with limited memory.

Best-of-K-bandits

Mon, 06 Jun 2016 00:00:00 +0000

This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds.

Interactive Algorithms: from Pool to Stream

Mon, 06 Jun 2016 00:00:00 +0000

We consider interactive algorithms in the pool-based setting, and in the stream-based setting. Interactive algorithms observe suggested elements (representing actions or queries), and interactively select some of them and receive responses. Pool-based algorithms can select elements at any order, while stream-based algorithms observe elements in sequence, and can only select elements immediately after observing them. We assume that the suggested elements are generated independently from some source distribution, and ask what is the stream size required for emulating a pool algorithm with a given pool size. We provide algorithms and matching lower bounds for general pool algorithms, and for utility-based pool algorithms. We further show that a maximal gap between the two settings exists also in the special case of active learning for binary classification.

Simple Bayesian Algorithms for Best Arm Identification

Mon, 06 Jun 2016 00:00:00 +0000

This paper considers the optimal adaptive allocation of measurement effort for identifying the best among a finite set of options or designs. An experimenter sequentially chooses designs to measure and observes noisy signals of their quality with the goal of confidently identifying the best design after a small number of measurements. I propose three simple Bayesian algorithms for adaptively allocating measurement effort. One is Top-Two Probability sampling, which computes the two designs with the highest posterior probability of being optimal, and then randomizes to select among these two. One is a variant a top-two sampling which considers not only the probability a design is optimal, but the expected amount by which its quality exceeds that of other designs. The final algorithm is a modified version of Thompson sampling that is tailored for identifying the best design. I prove that these simple algorithms satisfy a strong optimality property. In a frequestist setting where the true quality of the designs is fixed, one hopes the posterior definitively identifies the optimal design, in the sense that that the posterior probability assigned to the event that some other design is optimal converges to zero as measurements are collected. I show that under the proposed algorithms this convergence occurs at an \emphexponential rate, and the corresponding exponent is the best possible among all allocation rules.

How to calculate partition functions using convex programming hierarchies: provable bounds for variational methods

Mon, 06 Jun 2016 00:00:00 +0000

We consider the problem of approximating partition functions for Ising models. We make use of recent tools in combinatorial optimization: the Sherali-Adams and Lasserre convex programming hierarchies, in combination with variational methods to get algorithms for calculating partition functions in these families. These techniques give new, non-trivial approximation guarantees for the partition function beyond the regime of correlation decay. They also generalize some classical results from statistical physics about the Curie-Weiss ferromagnetic Ising model, as well as provide a partition function counterpart of classical results about max-cut on dense graphs (Arora, 1995). With this, we connect techniques from two apparently disparate research areas – optimization and counting/partition function approximations. (i.e. #-P type of problems). Furthermore, we design to the best of our knowledge the first provable, convex variational methods. Though in the literature there are a host of convex versions of variational methods, they come with no guarantees (apart from some extremely special cases, like e.g. the graph has a single cycle). We consider dense and low rank graphs, and interestingly, the reason our approach works on these types of graphs is because local correlations propagate to global correlations – completely the opposite of algorithms based on correlation decay. In the process we design novel entropy approximations based on the low-order moments of a distribution. Our proof techniques are very simple and generic, and likely to be applicable to many other settings other than Ising models.

When can we rank well from comparisons of O(n\log(n)) non-actively chosen pairs?

Mon, 06 Jun 2016 00:00:00 +0000

Ranking from pairwise comparisons is a ubiquitous problem and has been studied in disciplines ranging from statistics to operations research and from theoretical computer science to machine learning. Here we consider a general setting where outcomes of pairwise comparisons between items i and j are drawn probabilistically by flipping a coin with unknown bias P_ij , and ask under what conditions on these unknown probabilities one can learn a good ranking from comparisons of only O(n\log(n)) non-actively chosen pairs. Recent work has established this is possible under the Bradley-Terry-Luce (BTL) and noisy permutation (NP) models. Here we introduce a broad family of ‘low-rank’ conditions on the probabilities P_ij under which the resulting preference matrix P has low rank under some link function, and show these conditions encompass the BTL and Thurstone classes as special cases, but are considerably more general. We then give a new algorithm called low-rank pairwise ranking (LRPR) which provably learns a good ranking from comparisons of only O(n\log(n)) randomly chosen comparisons under such low-rank models. Our algorithm and analysis make use of tools from the theory of low-rank matrix completion, and provide a new perspective on the problem of ranking from pairwise comparisons in non-active settings.

Conference on Learning Theory 2016: Preface

Mon, 06 Jun 2016 00:00:00 +0000

Preface to COLT 2016

Cortical Computation via Iterative Constructions

Mon, 06 Jun 2016 00:00:00 +0000

We study Boolean functions of an arbitrary number of input variables that can be realized by simple iterative constructions based on constant-size primitives. This restricted type of construction needs little global coordination or control and thus is a candidate for neurally feasible computation. Valiant’s construction of a majority function can be realized in this manner and, as we show, can be generalized to any uniform threshold function. We study the rate of convergence, finding that while linear convergence to the correct function can be achieved for any threshold using a fixed set of primitives, for quadratic convergence, the size of the primitives must grow as the threshold approaches 0 or 1. We also study finite realizations of this process and the learnability of the functions realized. We show that the constructions realized are accurate outside a small interval near the target threshold, where the size of the construction grows as the inverse square of the interval width. This phenomenon, that errors are higher closer to thresholds (and thresholds closer to the boundary are harder to represent), is a well-known cognitive finding.

Open Problem: Parameter-Free and Scale-Free Online Algorithms

Mon, 06 Jun 2016 00:00:00 +0000

Existing vanilla algorithms for online linear optimization have O((ηR(u) + 1/η) \sqrtT) regret with respect to any competitor u, where R(u) is a 1-strongly convex regularizer and η> 0 is a tuning parameter of the algorithm. For certain decision sets and regularizers, the so-called \emphparameter-free algorithms have \widetilde O(\sqrtR(u) T) regret with respect to any competitor u. Vanilla algorithm can achieve the same bound only for a fixed competitor u known ahead of time by setting η= 1/\sqrtR(u). A drawback of both vanilla and parameter-free algorithms is that they assume that the norm of the loss vectors is bounded by a constant known to the algorithm. There exist \emphscale-free algorithms that have O((ηR(u) + 1/η) \sqrtT \max_1 \le t \le T \norm\ell_t) regret with respect to any competitor u and for any sequence of loss vector \ell_1, …, \ell_T. Parameter-free analogue of scale-free algorithms have never been designed. Is is possible to design algorithms that are simultaneously \emphparameter-free and \emphscale-free?

Density Evolution in the Degree-correlated Stochastic Block Model

Mon, 06 Jun 2016 00:00:00 +0000

There is a recent surge of interest in identifying the sharp recovery thresholds for cluster recovery under the stochastic block model. In this paper, we address the more refined question of how many vertices that will be misclassified on average. We consider the binary form of the stochastic block model, where n vertices are partitioned into two clusters with edge probability a/n within the first cluster, c/n within the second cluster, and b/n across clusters. Suppose that as n \to ∞, a= b+ μ\sqrt b , c=b+ ν\sqrt b for two fixed constants μ, ν, and b \to ∞with b=n^o(1). When the cluster sizes are balanced and μ≠ν, we show that the minimum fraction of misclassified vertices on average is given by Q(\sqrtv^*), where Q(x) is the Q-function for standard normal, v^* is the unique fixed point of v= \frac(μ-ν)^216 + \frac (μ+ν)^2 16 \mathbbE[ \tanh(v+ \sqrtv Z)], and Z is standard normal. Moreover, the minimum misclassified fraction on average is attained by a local algorithm, namely belief propagation, in time linear in the number of edges. Our proof techniques are based on connecting the cluster recovery problem to tree reconstruction problems, and analyzing the density evolution of belief propagation on trees with Gaussian approximations.

Learning Simple Auctions

Mon, 06 Jun 2016 00:00:00 +0000

We present a general framework for proving polynomial sample complexity bounds for the problem of learning from samples the best auction in a class of “simple” auctions. Our framework captures the most prominent examples of “simple” auctions, including anonymous and non-anonymous item and bundle pricings, with either a single or multiple buyers. The first step of the framework is to show that the set of auction allocation rules have a low-dimensional representation. The second step shows that, across the subset of auctions that share the same allocations on a given set of samples, the auction revenue varies in a low-dimensional way. Our results effectively imply that whenever it is possible to compute a near-optimal simple auction with a known prior, it is also possible to compute such an auction with an unknown prior, given a polynomial number of samples.

On the capacity of information processing systems

Mon, 06 Jun 2016 00:00:00 +0000

We propose and analyze a family of \emphinformation processing systems, where a finite set of experts or servers are employed to extract information about a stream of incoming jobs. Each job is associated with a hidden label drawn from some prior distribution. An inspection by an expert produces a noisy outcome that depends both on the job’s hidden label and the type of the expert, and occupies the expert for a finite time duration. A decision maker’s task is to dynamically assign inspections so that the resulting outcomes can be used to accurately recover the labels of all jobs, while keeping the system stable. Among our chief motivations are applications in crowd-sourcing, diagnostics, and experiment designs, where one wishes to efficiently discover the nature of a large number of items, using a finite pool of computational resources or human agents. We focus on the \emphcapacity of such an information processing system. Given a level of accuracy guarantee, we ask how many experts are needed in order to stabilize the system, and through what inspection architecture. Our main result provides an adaptive inspection policy that is asymptotically optimal in the following sense: the ratio between the required number of experts under our policy and the theoretical optimal converges to one, as the probability of error in label recovery tends to zero.

Learning Communities in the Presence of Errors

Mon, 06 Jun 2016 00:00:00 +0000

We study the problem of learning communities in the presence of modeling errors and give robust recovery algorithms for the Stochastic Block Model (SBM). This model, which is also known as the Planted Partition Model, is widely used for community detection and graph partitioning in various fields, including machine learning, statistics, and social sciences. Many algorithms exist for learning communities in the Stochastic Block Model, but they do not work well in the presence of errors. In this paper, we initiate the study of robust algorithms for partial recovery in SBM with modeling errors or noise. We consider graphs generated according to the Stochastic Block Model and then modified by an adversary. We allow two types of adversarial errors, Feige—Kilian or monotone errors, and edge outlier errors. Mossel, Neeman and Sly (STOC 2015) posed an open question about whether an almost exact recovery is possible when the adversary is allowed to add o(n) edges. Our work answers this question affirmatively even in the case of k>2 communities. We then show that our algorithms work not only when the instances come from SBM, but also work when the instances come from any distribution of graphs that is \varepsilon m close to SBM in the Kullback—Leibler divergence. This result also works in the presence of adversarial errors. Finally, we present almost tight lower bounds for two communities.

Gradient Descent Only Converges to Minimizers

Mon, 06 Jun 2016 00:00:00 +0000

We show that gradient descent converges to a local minimizer, almost surely with random initial- ization. This is proved by applying the Stable Manifold Theorem from dynamical systems theory.

Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits

Mon, 06 Jun 2016 00:00:00 +0000

I prove near-optimal frequentist regret guarantees for the finite-horizon Gittins index strategy for multi-armed bandits with Gaussian noise and prior. Along the way I derive finite-time bounds on the Gittins index that are asymptotically exact and may be of independent interest. I also discuss computational issues and present experimental results suggesting that a particular version of the Gittins index strategy is an improvement on existing algorithms with finite-time regret guarantees such as UCB and Thompson sampling.

Time series prediction and online learning

Mon, 06 Jun 2016 00:00:00 +0000

We present a series of theoretical and algorithmic results combining the benefits of the statistical learning approach to time series prediction with that of on-line learning. We prove new generalization guarantees for hypotheses derived from regret minimization algorithms in the general scenario where the data is generated by a non-stationary non-mixing stochastic process. Our theory enables us to derive model selection techniques with favorable theoretical guarantees in the scenario of time series, thereby solving a problem that is notoriously difficult in that scenario. It also helps us devise new ensemble methods with favorable theoretical guarantees for the task of forecasting non-stationary time series.

Online Isotonic Regression

Mon, 06 Jun 2016 00:00:00 +0000

We consider the online version of the isotonic regression problem. Given a set of linearly ordered points (e.g., on the real line), the learner must predict labels sequentially at adversarially chosen positions and is evaluated by her total squared loss compared against the best isotonic (non-decreasing) function in hindsight. We survey several standard online learning algorithms and show that none of them achieve the optimal regret exponent; in fact, most of them (including Online Gradient Descent, Follow the Leader and Exponential Weights) incur linear regret. We then prove that the Exponential Weights algorithm played over a covering net of isotonic functions has a regret bounded by O\big(T^1/3 \log^2/3(T)\big) and present a matching Ω(T^1/3) lower bound on regret. We provide a computationally efficient version of this algorithm. We also analyze the noise-free case, in which the revealed labels are isotonic, and show that the bound can be improved to O(\log T) or even to O(1) (when the labels are revealed in isotonic order). Finally, we extend the analysis beyond squared loss and give bounds for entropic loss and absolute loss.

Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja’s Algorithm

Mon, 06 Jun 2016 00:00:00 +0000

In this paper we provide improved guarantees for streaming principal component analysis (PCA). Given A_1, \ldots, A_n∈\mathbbR^d\times d sampled independently from distributions satisfying \mathbbE[A_i] = Σfor Σ\succeq 0, we present an O(d)-space linear-time single-pass streaming algorithm for estimating the top eigenvector of Σ. The algorithm nearly matches (and in certain cases improves upon) the accuracy obtained by the standard batch method that computes top eigenvector of the empirical covariance \frac1n \sum_i ∈[n] A_i as analyzed by the matrix Bernstein inequality. Moreover, to achieve constant accuracy, our algorithm improves upon the best previous known sample complexities of streaming algorithms by either a multiplicative factor of O(d) or 1/\mathrmgap where \mathrmgap is the relative distance between the top two eigenvalues of Σ. We achieve these results through a novel analysis of the classic Oja’s algorithm, one of the oldest and perhaps, most popular algorithms for streaming PCA. We show that simply picking a random initial point w_0 and applying the natural update rule w_i + 1 = w_i + \eta_i A_i w_i suffices for suitable choice of \eta_i. We believe our result sheds light on how to efficiently perform streaming PCA both in theory and in practice and we hope that our analysis may serve as the basis for analyzing many variants and extensions of streaming PCA.

Optimal rates for total variation denoising

Mon, 06 Jun 2016 00:00:00 +0000

Motivated by its practical success, we show that the 2D total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates of estimation for a large class of image models such as bi-isotonic, Hölder smooth and cartoons. Our analysis hinges on properties of the unnormalized Laplacian of the two-dimensional grid such as eigenvector delocalization and spectral decay. We also present extensions to more than two dimensions as well as several other graphs.

Online Learning with Low Rank Experts

Mon, 06 Jun 2016 00:00:00 +0000

We consider the problem of prediction with expert advice when the losses of the experts have low-dimensional structure: they are restricted to an unknown d-dimensional subspace. We devise algorithms with regret bounds that are independent of the number of experts and depend only on the rank d. For the stochastic model we show a tight bound of Θ(\sqrtdT), and extend it to a setting of an approximate d subspace. For the adversarial model we show an upper bound of O(d\sqrtT) and a lower bound of Ω(\sqrtdT).

Semidefinite Programs for Exact Recovery of a Hidden Community

Mon, 06 Jun 2016 00:00:00 +0000

We study a semidefinite programming (SDP) relaxation of the maximum likelihood estimation for exactly recovering a hidden community of cardinality K from an n \times n symmetric data matrix A, where for distinct indices i,j, A_ij ∼P if i, j are both in the community and A_ij ∼Q otherwise, for two known probability distributions P and Q. We identify a sufficient condition and a necessary condition for the success of SDP for the general model. For both the Bernoulli case (P=\rm Bern(p) and Q=\rm Bern(q) with p>q) and the Gaussian case (P=\mathcalN(μ,1) and Q=\mathcalN(0,1) with μ>0), which correspond to the problem of planted dense subgraph recovery and submatrix localization respectively, the general results lead to the following findings: (1) If K=ω( n /\log n), SDP attains the information-theoretic recovery limits with sharp constants; (2) If K=Θ(n/\log n), SDP is order-wise optimal, but strictly suboptimal by a constant factor; (3) If K=o(n/\log n) and K \to ∞, SDP is order-wise suboptimal. The same critical scaling for K is found to hold, up to constant factors, for the performance of SDP on the stochastic block model of n vertices partitioned into multiple communities of equal size K. A key ingredient in the proof of the necessary condition is a construction of a primal feasible solution based on random perturbation of the true cluster matrix.

Maximin Action Identification: A New Bandit Framework for Games

Mon, 06 Jun 2016 00:00:00 +0000

We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower- and upper- confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm.

Optimal Best Arm Identification with Fixed Confidence

Mon, 06 Jun 2016 00:00:00 +0000

We give a complete characterization of the complexity of best-arm identification in one-parameter bandit problems. We prove a new, tight lower bound on the sample complexity. We propose the ‘Track-and-Stop’ strategy, which we prove to be asymptotically optimal. It consists in a new sampling rule (which tracks the optimal proportions of arm draws highlighted by the lower bound) and in a stopping rule named after Chernoff, for which we give a new analysis.

Preference-based Teaching

Mon, 06 Jun 2016 00:00:00 +0000

We introduce a new model of teaching named “preference-based teaching” and a corresponding complexity parameter—the preference-based teaching dimension (PBTD)—representing the worst-case number of examples needed to teach any concept in a given concept class. Although the PBTD coincides with the well-known recursive teaching dimension (RTD) on finite classes, it is radically different on infinite ones: the RTD becomes infinite already for trivial infinite classes (such as half-intervals) whereas the PBTD evaluates to reasonably small values for a wide collection of infinite classes including classes consisting of so-called closed sets w.r.t. a given closure operator, including various classes related to linear sets over \mathbbN_0 (whose RTD had been studied quite recently) and including the class of Euclidean half-spaces (and some other geometric classes). On top of presenting these concrete results, we provide the reader with a theoretical framework (of a combinatorial flavor) which helps to derive bounds on the PBTD.

Open Problem: Property Elicitation and Elicitation Complexity

Mon, 06 Jun 2016 00:00:00 +0000

The study of property elicitation is gaining ground in statistics and machine learning as a way to view and reason about the expressive power of emiprical risk minimization (ERM). Yet beyond a widening frontier of special cases, the two most fundamental questions in this area remain open: which statistics are elicitable (computable via ERM), and which loss functions elicit them? Moreover, recent work suggests a complementary line of questioning: given a statistic, how many ERM parameters are needed to compute it? We give concrete instantiations of these important questions, which have numerous applications to machine learning and related fields.

Open Problem: Second order regret bounds based on scaling time

Mon, 06 Jun 2016 00:00:00 +0000

We argue that the second order bounds given in Cesa-Bianchi2006, which accumulate the square of the loss of each action separately, are loose. We propose a different form of a second order bound and conjecture the it is satisfied by NormalHedge ChaudhuriFrHs2009.

Online Sparse Linear Regression

Mon, 06 Jun 2016 00:00:00 +0000

We consider the online sparse linear regression problem, which is the problem of sequentially making predictions observing only a limited number of features in each round, to minimize regret with respect to the best sparse linear regressor, where prediction accuracy is measured by square loss. We give an \em inefficient algorithm that obtains regret bounded by \tildeO(\sqrtT) after T prediction rounds. We complement this result by showing that no algorithm running in polynomial time per iteration can achieve regret bounded by O(T^1-δ) for any constant δ> 0 unless \textsfNP ⊆\textsfBPP. This computational hardness result resolves an open problem presented in COLT 2014 (Kale, 2014) and also posed by Zolghadr et al. (2013). This hardness result holds even if the algorithm is allowed to access more features than the best sparse linear regressor up to a logarithmic factor in the dimension.

Spectral thresholds in the bipartite stochastic block model

Mon, 06 Jun 2016 00:00:00 +0000

We consider a bipartite stochastic block model on vertex sets V_1 and V_2, with planted partitions in each, and ask at what densities efficient algorithms can recover the partition of the smaller vertex set. When |V_2| ≫|V_1|, multiple thresholds emerge. We first locate a sharp threshold for detection of the partition, in the sense of the results of Mossel, Neeman and Sly and Massoulie for the stochastic block model. We then show that at a higher edge density, the singular vectors of the rectangular biadjacency matrix exhibit a localization / delocalization phase transition, giving recovery above the threshold and no recovery below. Nevertheless, we propose a simple spectral algorithm, Diagonal Deletion SVD, which recovers the partition at a nearly optimal edge density. The bipartite stochastic block model studied here was used by Feldman, Perkins, Vempala to give a unified algorithm for recovering planted partitions and assignments in random hypergraphs and random k-SAT formulae respectively. Our results give the best known bounds for the clause density at which solutions can be found efficiently in these models as well as showing a barrier to further improvement via this reduction to the bipartite block model.

Online Learning and Blackwell Approachability in Quitting Games

Mon, 06 Jun 2016 00:00:00 +0000

We consider the sequential decision problem known as regret minimization, or more precisely its generalization to the vectorial or multi-criteria setup called Blackwell approachability. We assume that Nature, the decision maker, or both, might have some quitting (or terminating) actions so that the stream of payoffs is constant whenever they are chosen. We call those environments “quitting games”. We characterize convex target sets \cC that are Blackwell approachable, in the sense that the decision maker has a policy ensuring that the expected average vector payoff converges to \cC at some given horizon known in advance. Moreover, we also compare these results to the cases where the horizon is not known and show that, unlike in standard online learning literature, the necessary or sufficient conditions for the anytime version of this problem are drastically different than those for the fixed horizon.

Open Problem: Kernel methods on manifolds and metric spaces. What is the probability of a positive definite geodesic exponential kernel?

Mon, 06 Jun 2016 00:00:00 +0000

Radial kernels are well-suited for machine learning over general geodesic metric spaces, where pairwise distances are often the only computable quantity available. We have recently shown that geodesic exponential kernels are only positive definite for all bandwidths when the input space has strong linear properties. This negative result hints that radial kernel are perhaps not suitable over geodesic metric spaces after all. Here, however, we present evidence that large intervals of bandwidths exist where geodesic exponential kernels have high probability of being positive definite over finite datasets, while still having significant predictive power. From this we formulate conjectures on the probability of a positive definite kernel matrix for a finite random sample, depending on the geometry of the data space and the spread of the sample.

The Power of Depth for Feedforward Neural Networks

Mon, 06 Jun 2016 00:00:00 +0000

We show that there is a simple (approximately radial) function on \mathbbR^d, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for virtually all known activation functions, including rectified linear units, sigmoids and thresholds, and formally demonstrates that depth – even if increased by 1 – can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different.

Asymptotic behavior of \ell_p-based Laplacian regularization in semi-supervised learning

Mon, 06 Jun 2016 00:00:00 +0000

Given a weighted graph with N vertices, consider a real-valued regression problem in a semi-supervised setting, where one observes n labeled vertices, and the task is to label the remaining ones. We present a theoretical study of \ell_p-based Laplacian regularization under a d-dimensional geometric random graph model. We provide a variational characterization of the performance of this regularized learner as N grows to infinity while n stays constant; the associated optimality conditions lead to a partial differential equation that must be satisfied by the associated function estimate \widehatf. From this formulation we derive several predictions on the limiting behavior the function \fhat, including (a) a phase transition in its smoothness at the threshold p = d + 1; and (b) a tradeoff between smoothness and sensitivity to the underlying unlabeled data distribution P. Thus, over the range p ≤d, the function estimate \widehatf is degenerate and “spiky,” whereas for p≥d+1, the function estimate \fhat is smooth. We show that the effect of the underlying density vanishes monotonically with p, such that in the limit p = ∞, corresponding to the so-called Absolutely Minimal Lipschitz Extension, the estimate \widehatf is independent of the distribution P. Under the assumption of semi-supervised smoothness, ignoring P can lead to poor statistical performance; in particular, we construct a specific example for d=1 to demonstrate that p=2 has lower risk than p=∞due to the former penalty adapting to P and the latter ignoring it. We also provide simulations that verify the accuracy of our predictions for finite sample sizes. Together, these properties show that p = d+1 is an optimal choice, yielding a function estimate \fhat that is both smooth and non-degenerate, while remaining maximally sensitive to P.

Properly Learning Poisson Binomial Distributions in Almost Polynomial Time

Mon, 06 Jun 2016 00:00:00 +0000

We give an algorithm for properly learning Poisson binomial distributions. A Poisson binomial distribution (PBD) of order n ∈\mathbbZ_+ is the discrete probability distribution of the sum of n mutually independent Bernoulli random variables. Given \widetildeO(1/ε^2) samples from an unknown PBD P, our algorithm runs in time (1/\eps)^O(\log \log (1/ε)), and outputs a hypothesis PBD that is ε-close to P in total variation distance. The sample complexity of our algorithm is known to be nearly-optimal, up to logarithmic factors, as established in previous work. However, the previously best known running time for properly learning PBDs was (1/ε)^O(\log(1/ε)), and was essentially obtained by enumeration over an appropriate ε-cover. We remark that the running time of this cover-based approach cannot be improved, as any ε-cover for the space of PBDs has size (1/ε)^Ω(\log(1/ε)). As one of our main contributions, we provide a novel structural characterization of PBDs, showing that any PBD P is ε-close to another PBD Q with O(\log(1/ε)) distinct parameters. More precisely, we prove that, for all ε>0, there exists an explicit collection \calM of (1/ε)^O(\log \log (1/ε)) vectors of multiplicities, such that for any PBD P there exists a PBD Q with O(\log(1/ε)) distinct parameters whose multiplicities are given by some element of \cal M, such that Q is ε-close to P. Our proof combines tools from Fourier analysis and algebraic geometry. Our approach to the proper learning problem is as follows: Starting with an accurate non-proper hypothesis, we fit a PBD to this hypothesis. This fitting problem can be formulated as a natural polynomial optimization problem. Our aforementioned structural characterization allows us to reduce the corresponding fitting problem to a collection of (1/ε)^O(\log \log(1/ε)) systems of low-degree polynomial inequalities. We show that each such system can be solved in time (1/ε)^O(\log \log(1/ε)), which yields the overall running time of our algorithm.

Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables

Mon, 06 Jun 2016 00:00:00 +0000

We study the structure and learnability of sums of independent integer random variables (SIIRVs). For k ∈\mathbbZ_+, a \emk-SIIRV of order n ∈\mathbbZ_+ is the probability distribution of the sum of n mutually independent random variables each supported on {0, 1, …, k-1}. We denote by \cal S_n,k the set of all k-SIIRVs of order n. How many samples are required to learn an arbitrary distribution in \cal S_n,k? In this paper, we tightly characterize the sample and computational complexity of this problem. More precisely, we design a computationally efficient algorithm that uses \widetildeO(k/ε^2) samples, and learns an arbitrary k-SIIRV within error ε, in total variation distance. Moreover, we show that the \em optimal sample complexity of this learning problem is Θ((k/ε^2)\sqrt\log(1/ε)), i.e., we prove an upper bound and a matching information-theoretic lower bound. Our algorithm proceeds by learning the Fourier transform of the target k-SIIRV in its effective support. Its correctness relies on the \em approximate sparsity of the Fourier transform of k-SIIRVs – a structural property that we establish, roughly stating that the Fourier transform of k-SIIRVs has small magnitude outside a small set. Along the way we prove several new structural results about k-SIIRVs. As one of our main structural contributions, we give an efficient algorithm to construct a sparse \em proper ε-cover for \cal S_n,k, in total variation distance. We also obtain a novel geometric characterization of the space of k-SIIRVs. Our characterization allows us to prove a tight lower bound on the size of ε-covers for \cal S_n,k – establishing that our cover upper bound is optimal – and is the key ingredient in our tight sample complexity lower bound. Our approach of exploiting the sparsity of the Fourier transform in distribution learning is general, and has recently found additional applications. In a subsequent work, we use a generalization of this idea to obtain the first computationally efficient learning algorithm for Poisson multinomial distributions. In a separate work, we build on our Fourier-based approach to obtain the fastest known proper learning algorithm for Poisson binomial distributions (2-SIIRVs).

Complexity Theoretic Limitations on Learning DNF’s

Mon, 06 Jun 2016 00:00:00 +0000

Using the recently developed framework of Daniely, Linial and Shalev-Shwartz, we show that under a natural assumption on the complexity of random K-SAT, learning DNF formulas is hard. Furthermore, the same assumption implies the hardness of various learning problems, including intersections of logarithmically many halfspaces, agnostically learning conjunctions, as well as virtually all (distribution free) learning problems that were previously shown hard (under various complexity assumptions).

Adaptive Learning with Robust Generalization Guarantees

Mon, 06 Jun 2016 00:00:00 +0000

The traditional notion of \emphgeneralization—i.e., learning a hypothesis whose empirical error is close to its true error—is surprisingly brittle. As has recently been noted [Dwork et al. 2015], even if several algorithms have this guarantee in isolation, the guarantee need not hold if the algorithms are composed adaptively. In this paper, we study three notions of generalization—increasing in strength—that are \emphrobust to postprocessing and amenable to adaptive composition, and examine the relationships between them. We call the weakest such notion \emphRobust Generalization. A second, intermediate, notion is the stability guarantee known as \emphdifferential privacy. The strongest guarantee we consider we call \emphPerfect Generalization. We prove that every hypothesis class that is PAC learnable is also PAC learnable in a robustly generalizing fashion, with almost the same sample complexity. It was previously known that differentially private algorithms satisfy robust generalization. In this paper, we show that robust generalization is a strictly weaker concept, and that there is a learning task that can be carried out subject to robust generalization guarantees, yet cannot be carried out subject to differential privacy. We also show that perfect generalization is a strictly stronger guarantee than differential privacy, but that, nevertheless, many learning tasks can be carried out subject to the guarantees of perfect generalization.

A Light Touch for Heavily Constrained SGD

Mon, 06 Jun 2016 00:00:00 +0000

Minimizing empirical risk subject to a set of constraints can be a useful strategy for learning restricted classes of functions, such as monotonic functions, submodular functions, classifiers that guarantee a certain class label for some subset of examples, etc. However, these restrictions may result in a very large number of constraints. Projected stochastic gradient descent (SGD) is often the default choice for large-scale optimization in machine learning, but requires a projection after each update. For heavily-constrained objectives, we propose an efficient extension of SGD that stays close to the feasible region while only applying constraints probabilistically at each iteration. Theoretical analysis shows a compelling trade-off between per-iteration work and the number of iterations needed on problems with a large number of constraints.

On the Expressive Power of Deep Learning: A Tensor Analysis

Mon, 06 Jun 2016 00:00:00 +0000

It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.

Provably manipulation-resistant reputation systems

Mon, 06 Jun 2016 00:00:00 +0000

Reputation and reliability play a central role in a wide range of applications, from online marketplaces to review aggregators to ridesharing services. Many reputation systems are vulnerable to manipulation, and protected only by keeping algorithms secret, avoiding high-stakes applications, or using side information to identify malicious users. The current situation is reminiscent of pre-modern cryptography, characterized by a patchwork of ad hoc techniques with minimal formal understanding of their security. We propose a reputation system which provably achieves a very strong correctness guarantee under extremely pessimistic assumptions—it works even given a supermajority of malicious users, converges to optimal behavior after a constant number of interactions per user, does not require repeated interactions, and accommodates time-varying quality of resources. Our formal model is simple but general. In each period, a user is given an opportunity to interact with a resource, and must accept or reject the proposed interaction. If they accept, they receive a payoff in [-1, 1]. Ideally all users would behave honestly, pooling their data and quickly learning which resources are worth interacting with. Our protocol essentially matches this performance when all users are honest, while guaranteeing that adding malicious users or users with varying tastes does very little damage. We also extend our results to a more challenging setting where users interact with each other rather than with static resources, and where the two parties to an interaction may receive different payoffs.

Open Problem: Best Arm Identification: Almost Instance-Wise Optimality and the Gap Entropy Conjecture

Mon, 06 Jun 2016 00:00:00 +0000

The best arm identification problem (BEST-1-ARM) is the most basic pure exploration problem in stochastic multi-armed bandits. The problem has a long history and attracted significant attention for the last decade. However, we do not yet have a complete understanding of the optimal sample complexity of the problem: The state-of-the-art algorithms achieve a sample complexity of O(\sum_i=2^n \Delta_i^-2(\lnδ^-1 + \ln\ln\Delta_i^-1)) (\Delta_i is the difference between the largest mean and the i^th mean), while the best known lower bound is Ω(\sum_i=2^n \Delta_i^-2\lnδ^-1) for general instances and Ω(∆^-2 \ln\ln ∆^-1) for the two-arm instances. We propose to study the instance-wise optimality for the BEST-1-ARM problem. Previous work has proved that it is impossible to have an instance optimal algorithm for the 2-arm problem. However, we conjecture that modulo the additive term Ω(\Delta_2^-2 \ln\ln \Delta_2^-1) (which is an upper bound and worst case lower bound for the 2-arm problem), there is an instance optimal algorithm for BEST-1-ARM. Moreover, we introduce a new quantity, called the gap entropy for a best-arm problem instance, and conjecture that it is the instance-wise lower bound. Hence, resolving this conjecture would provide a final answer to the old and basic problem.

Pure Exploration of Multi-armed Bandit Under Matroid Constraints

Mon, 06 Jun 2016 00:00:00 +0000

We study the pure exploration problem subject to a matroid constraint (Best-Basis) in a stochastic multi-armed bandit game. In a Best-Basis instance, we are given n stochastic arms with unknown reward distributions, as well as a matroid \mathcalM over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify a basis of \mathcalM with the maximum total weight, using as few samples as possible. The problem is a significant generalization of the best arm identification problem and the top-k arm identification problem, which have attracted significant attentions in recent years. We study both the exact and PAC versions of Best-Basis, and provide algorithms with nearly-optimal sample complexities for these versions. Our results generalize and/or improve on several previous results for the top-k arm identification problem and the combinatorial pure exploration problem when the combinatorial constraint is a matroid.

On the Approximability of Sparse PCA

Mon, 06 Jun 2016 00:00:00 +0000

It is well known that Sparse PCA (Sparse Principal Component Analysis) is NP-hard to solve exactly on worst-case instances. What is the complexity of solving Sparse PCA approximately? Our contributions include: \beginenumerate \item a simple and efficient algorithm that achieves an n^-1/3-approximation; \item NP-hardness of approximation to within (1-\varepsilon), for some small constant \varepsilon > 0; \item SSE-hardness of approximation to within \em any constant factor; and \item an \exp\exp\left(Ω\left(\sqrt\log \log n\right)\right) (“quasi-quasi-polynomial”) gap for the standard semidefinite program. \endenumerate

Delay and Cooperation in Nonstochastic Bandits

Mon, 06 Jun 2016 00:00:00 +0000

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem

Mon, 06 Jun 2016 00:00:00 +0000

We consider the problem of \textitbest arm identification with a \textitfixed budget T, in the K-armed stochastic bandit setting, with arms distribution defined on [0,1]. We prove that any bandit strategy, for at least one bandit problem characterized by a complexity H, will misidentify the best arm with probability lower bounded by $\exp\Big(-\frac{T}\log(K)H\Big)$, where $H$ is the sum for all sub-optimal arms of the inverse of the squared gaps. Our result disproves formally the general belief - coming from results in the fixed confidence setting - that there must exist an algorithm for this problem whose probability of error is upper bounded by $\exp(-T/H)$. This also proves that some existing strategies based on the Successive Rejection of the arms are optimal - closing therefore the current gap between upper and lower bounds for the fixed budget best arm identification problem.

Multi-scale exploration of convex functions and bandit convex optimization

Mon, 06 Jun 2016 00:00:00 +0000

We construct a new map from a convex function to a distribution on its domain, with the property that this distribution is a multi-scale exploration of the function. We use this map to solve a decade-old open problem in adversarial bandit convex optimization by showing that the minimax regret for this problem is \tildeO(\mathrmpoly(n) \sqrtT), where n is the dimension and T the number of rounds. This bound is obtained by studying the dual Bayesian maximin regret via the information ratio analysis of Russo and Van Roy, and then using the multi-scale exploration to construct a new algorithm for the Bayesian convex bandit problem.

Dropping Convexity for Faster Semi-definite Optimization

Mon, 06 Jun 2016 00:00:00 +0000

We study the minimization of a convex function f(X) over the set of n \times n positive semi-definite matrices, but when the problem is recast as \min_U g(U) := f(UU^⊤), with U ∈\mathbbR^n \times r and r ≤n. We study the performance of gradient descent on g—which we refer to as Factored Gradient Descent (\textscFgd)—under standard assumptions on the \em original function f. We provide a rule for selecting the step size and, with this choice, show that the \emphlocal convergence rate of \textscFgd mirrors that of standard gradient descent on the original f: \emphi.e., after k steps, the error is O(1/k) for smooth f, and exponentially small in k when f is (restricted) strongly convex. In addition, we provide a procedure to initialize \textscFgd for (restricted) strongly convex objectives and when one only has access to f via a first-order oracle; for several problem instances, such proper initialization leads to \emphglobal convergence guarantees. \textscFgd and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.

Aggregation of supports along the Lasso path

Mon, 06 Jun 2016 00:00:00 +0000

In linear regression with fixed design, we propose two procedures that aggregate a data-driven collection of supports. The collection is a subset of the 2^p possible supports and both its cardinality and its elements can depend on the data. The procedures satisfy oracle inequalities with no assumption on the design matrix. Then we use these procedures to aggregate the supports that appear on the regularization path of the Lasso in order to construct an estimator that mimics the best Lasso estimator. If the restricted eigenvalue condition on the design matrix is satisfied, then this estimator achieves optimal prediction bounds. Finally, we discuss the computational cost of these procedures.

Basis Learning as an Algorithmic Primitive

Mon, 06 Jun 2016 00:00:00 +0000

A number of important problems in theoretical computer science and machine learning can be interpreted as recovering a certain basis. These include symmetric matrix eigendecomposition, certain tensor decompositions, Independent Component Analysis (ICA), spectral clustering and Gaussian mixture learning. Each of these problems reduces to an instance of our general model, which we call a “Basis Encoding Function" (BEF). We show that learning a basis within this model can then be provably and efficiently achieved using a first order iteration algorithm (gradient iteration). Our algorithm goes beyond tensor methods while generalizing a number of existing algorithms—e.g., the power method for symmetric matrices, the tensor power iteration for orthogonal decomposable tensors, and cumulant-based FastICA—all within a broader function-based dynamical systems framework. Our framework also unifies the unusual phenomenon observed in these domains that they can be solved using efficient non-convex optimization. Specifically, we describe a class of BEFs such that their local maxima on the unit sphere are in one-to-one correspondence with the basis elements. This description relies on a certain “hidden convexity" property of these functions. We provide a complete theoretical analysis of the gradient iteration even when the BEF is perturbed. We show convergence and complexity bounds polynomial in dimension and other relevant parameters, such as perturbation size. Our perturbation results can be considered as a non-linear version of the classical Davis-Kahan theorem for perturbations of eigenvectors of symmetric matrices. In addition we show that our algorithm exhibits fast (superlinear) convergence and relate the speed of convergence to the properties of the BEF. Moreover, the gradient iteration algorithm can be easily and efficiently implemented in practice.

Noisy Tensor Completion via the Sum-of-Squares Hierarchy

Mon, 06 Jun 2016 00:00:00 +0000

In the noisy tensor completion problem we observe m entries (whose location is chosen uniformly at random) from an unknown n_1 \times n_2 \times n_3 tensor T. We assume that T is entry-wise close to being rank r. Our goal is to fill in its missing entries using as few observations as possible. Let n = \max(n_1, n_2, n_3). We show that if m = n^3/2 r then there is a polynomial time algorithm based on the sixth level of the sum-of-squares hierarchy for completing it. Our estimate agrees with almost all of T’s entries almost exactly and works even when our observations are corrupted by noise. This is also the first algorithm for tensor completion that works in the overcomplete case when r > n, and in fact it works all the way up to r = n^3/2-ε. Our proofs are short and simple and are based on establishing a new connection between noisy tensor completion (through the language of Rademacher complexity) and the task of refuting random constant satisfaction problems. This connection seems to have gone unnoticed even in the context of matrix completion. Furthermore, we use this connection to show matching lower bounds. Our main technical result is in characterizing the Rademacher complexity of the sequence of norms that arise in the sum-of-squares relaxations to the tensor nuclear norm. These results point to an interesting new direction: Can we explore computational vs. sample complexity tradeoffs through the sum-of-squares hierarchy?

Information-theoretic thresholds for community detection in sparse networks

Mon, 06 Jun 2016 00:00:00 +0000

We give upper and lower bounds on the information-theoretic threshold for community detection in the stochastic block model. Specifically, consider a symmetric stochastic block model with q groups, average degree d, and connection probabilities c_\mathrmin/n and c_\mathrmout/n for within-group and between-group edges respectively; let λ= (c_\mathrmin-c_\mathrmout)/(qd). We show that, when q is large, and λ= O(1/q), the critical value of d at which community detection becomes possible—in physical terms, the condensation threshold—is $ d_\mathrmc = Θ\left( \frac\log qq λ^2 \right) , with tighter results in certain regimes. Above this threshold, we show that any partition of the nodes into q groups which is as ‘good’ as the planted one, in terms of the number of within- and between-group edges, is correlated with it. This gives an exponential-time algorithm that performs better than chance; specifically, community detection becomes possible below the Kesten-Stigum bound for q \ge 5 in the disassortative case λ< 0, and for q \ge 11 in the assortative case λ> 0 (similar upper bounds were obtained independently by Abbe and Sandon). Conversely, below this threshold, we show that no algorithm can label the vertices better than chance, or even distinguish the block model from an Erdős-Rényi random graph with high probability. Our lower bound on d_\mathrmc uses Robinson and Wormald’s small subgraph conditioning method, and we also give (less explicit) results for non-symmetric stochastic block models. In the symmetric case, we obtain explicit results by using bounds on certain functions of doubly stochastic matrices due to Achlioptas and Naor; indeed, our lower bound on d_\mathrmc is their second moment lower bound on the q$-colorability threshold for random graphs with a certain effective degree.

On the low-rank approach for semidefinite programs arising in synchronization and community detection

Mon, 06 Jun 2016 00:00:00 +0000

To address difficult optimization problems, convex relaxations based on semidefinite programming are now common place in many fields. Although solvable in polynomial time, large semidefinite programs tend to be computationally challenging. Over a decade ago, exploiting the fact that in many applications of interest the desired solutions are low rank, Burer and Monteiro proposed a heuristic to solve such semidefinite programs by restricting the search space to low-rank matrices. The accompanying theory does not explain the extent of the empirical success. We focus on Synchronization and Community Detection problems and provide theoretical guarantees shedding light on the remarkable efficiency of this heuristic.

Instance-dependent Regret Bounds for Dueling Bandits

Mon, 06 Jun 2016 00:00:00 +0000

We study the multi-armed dueling bandit problem in which feedback is provided in the form of relative comparisons between pairs of actions, with the goal of eventually learning to select actions that are close to the best. Following Dudik et al. (2015), we aim for algorithms whose performance approaches that of the optimal randomized choice of actions, the von Neumann winner, expressly avoiding more restrictive assumptions, for instance, regarding the existence of a single best action (a Condorcet winner). In this general setting, the best known algorithms achieve regret O(\sqrtKT) in T rounds with K actions. In this paper, we present the first instance-dependent regret bounds for the general problem, focusing particularly on when the von Neumann winner is sparse. Specifically, we propose a new algorithm whose regret, relative to a unique von Neumann winner with sparsity s, is at most O(\sqrtsT), plus an instance-dependent constant. Thus, when the sparsity is much smaller than the total number of actions, our result indicates that learning can be substantially faster.

Learning Combinatorial Functions from Pairwise Comparisons

Mon, 06 Jun 2016 00:00:00 +0000

A large body of work in machine learning has focused on the problem of learning a close approximation to an underlying combinatorial function, given a small set of labeled examples. However, for real-valued functions, cardinal labels might not be accessible, or it may be difficult for an expert to consistently assign real-valued labels over the entire set of examples. For instance, it is notoriously hard for consumers to reliably assign values to bundles of merchandise. Instead, it might be much easier for a consumer to report which of two bundles she likes better. With this motivation in mind, we consider an alternative learning model, wherein the algorithm must learn the underlying function up to pairwise comparisons, from pairwise comparisons. In this model, we present a series of novel algorithms that learn over a wide variety of combinatorial function classes. These range from graph functions to broad classes of valuation functions that are fundamentally important in microeconomic theory, the analysis of social networks, and machine learning, such as coverage, submodular, XOS, and subadditive functions, as well as functions with sparse Fourier support.

An Improved Gap-Dependency Analysis of the Noisy Power Method

Mon, 06 Jun 2016 00:00:00 +0000

We consider the \emphnoisy power method algorithm, which has wide applications in machine learning and statistics, especially those related to principal component analysis (PCA) under resource (communication, memory or privacy) constraints. Existing analysis of the noisy power method shows an unsatisfactory dependency over the “consecutive" spectral gap (\sigma_k-\sigma_k+1) of an input data matrix, which could be very small and hence limits the algorithm’s applicability. In this paper, we present a new analysis of the noisy power method that achieves improved gap dependency for both sample complexity and noise tolerance bounds. More specifically, we improve the dependency over (\sigma_k-\sigma_k+1) to dependency over (\sigma_k-\sigma_q+1), where q is an intermediate algorithm parameter and could be much larger than the target rank k. Our proofs are built upon a novel characterization of proximity between two subspaces that differ from canonical angle characterizations analyzed in previous works. Finally, we apply our improved bounds to distributed private PCA and memory-efficient streaming PCA and obtain bounds that are superior to existing results in the literature.

Highly-Smooth Zero-th Order Online Optimization

Mon, 06 Jun 2016 00:00:00 +0000

The minimization of convex functions which are only available through partial and noisy information is a key methodological problem in many disciplines. In this paper we consider convex optimization with noisy zero-th order information, that is noisy function evaluations at any desired point. We focus on problems with high degrees of smoothness, such as logistic regression. We show that as opposed to gradient-based algorithms, high-order smoothness may be used to improve estimation rates, with a precise dependence of our upper-bounds on the degree of smoothness. In particular, we show that for infinitely differentiable functions, we recover the same dependence on sample size as gradient-based algorithms, with an extra dimension-dependent factor. This is done for both convex and strongly-convex functions, with finite horizon and anytime algorithms. Finally, we also recover similar results in the online optimization setting.

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Mon, 06 Jun 2016 00:00:00 +0000

Planning plays an important role in the broad class of decision theory. Planning has drawn much attention in recent work in the robotics and sequential decision making areas. Recently, Reinforcement Learning (RL), as an agent-environment interaction problem, has brought further attention to planning methods. Generally in RL, one can assume a generative model, e.g. graphical models, for the environment, and then the task for the RL agent is to learn the model parameters and find the optimal strategy based on these learnt parameters. Based on environment behavior, the agent can assume various types of generative models, e.g. Multi Armed Bandit for a static environment, or Markov Decision Process (MDP) for a dynamic environment. The advantage of these popular models is their simplicity, which results in tractable methods of learning the parameters and finding the optimal policy. The drawback of these models is again their simplicity: these models usually underfit and underestimate the actual environment behavior. For example, in robotics, the agent usually has noisy observations of the environment inner state and MDP is not a suitable model. More complex models like Partially Observable Markov Decision Process (POMDP) can compensate for this drawback. Fitting this model to the environment, where the partial observation is given to the agent, generally gives dramatic performance improvement, sometimes unbounded improvement, compared to MDP. In general, finding the optimal policy for the POMDP model is computationally intractable and fully non convex, even for the class of memoryless policies. The open problem is to come up with a method to find an exact or an approximate optimal stochastic memoryless policy for POMDP models.

Reinforcement Learning of POMDPs using Spectral Methods

Mon, 06 Jun 2016 00:00:00 +0000

We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound w.r.t. the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.

Learning and 1-bit Compressed Sensing under Asymmetric Noise

Mon, 06 Jun 2016 00:00:00 +0000

We study the \emphapproximate recovery problem: Given corrupted 1-bit measurements of the form sign(w^* ⋅x_i), recover a vector w that is a good approximation to w^* ∈\Re^d. This problem has been studied by both the learning theory and signal processing communities. In learning theory, this is known as the problem of \emphlearning halfspaces with noise, and in signal processing, as \emph1-bit compressed sensing, in which there is an additional assumption that w^* is t-sparse. The challenge in both cases is to design computationally efficient algorithms that are tolerant to large amounts of noise under realistic noise models. Furthermore, in the case of 1-bit compressed sensing, we require the number of measurements x_i to scale polynomially in t and only polylogarithmically in d, the ambient dimension. In this work, we introduce algorithms with nearly optimal guarantees for both problems under two realistic noise models, \emphbounded (Massart) noise and \emphadversarial (agnostic) noise, when the measurements x_i’s are drawn from any isotropic log-concave distribution. In bounded (Massart) noise, an adversary can flip the measurement of each point x with probability η(x)≤η< 1/2. For this problem, we present an efficient algorithm that returns w such that \|w- w^*\|_2 ≤εin time poly(d, \frac 1 ε) for \emphany constant η< 1/2. This improves significantly over the best known result of Awasthi et al. 2015, in this space that required the noise to be as small as η≈10^-6. We then introduce an attribute-efficient variant of this algorithm for 1-bit compressed sensing that achieves the same guarantee with poly(t, \log(d), \frac 1 ε) measurements when \|w^*\|_0≤t. For adversarial (agnostic) noise, where any νfraction of measurements can be corrupted, we provide an algorithm that returns w such that \|w-w^*\|_2 ≤O(ν) + ε, with \tildeΩ( \frac t ε^3 \polylog(d)) measurements. Our results improve on the best known approximation results in this space and under some regimes improve on the sample complexity of the existing results. Furthermore, this is the first result of its kind in 1-bit compressed sensing that goes beyond the Gaussian marginal distribution and works for any isotrpic log-concave distribution.

Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models

Mon, 06 Jun 2016 00:00:00 +0000

In this paper we study a model-based approach to calculating approximately optimal policies in Markovian Decision Processes. In particular, we derive novel bounds on the loss of using a policy derived from a factored linear model, a class of models which generalize numerous previous models out of those that come with strong computational guarantees. For the first time in the literature, we derive performance bounds for model-based techniques where the model inaccuracy is measured in weighted norms. Moreover, our bounds show a decreased sensitivity to the discount factor and, unlike similar bounds derived for other approaches, they are insensitive to measure mismatch. Similarly to previous works, our proofs are also based on contraction arguments, but with the main differences that we use carefully constructed norms building on Banach lattices, and the contraction property is only assumed for operators acting on “compressed” spaces, thus weakening previous assumptions, while strengthening previous results.

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits

Mon, 06 Jun 2016 00:00:00 +0000

We present an algorithm that achieves almost optimal pseudo-regret bounds against adversarial and stochastic bandits. Against adversarial bandits the pseudo-regret is O(K\sqrtn \log n) and against stochastic bandits the pseudo-regret is O(\sum_i (\log n)/\Delta_i). We also show that no algorithm with O(\log n) pseudo-regret against stochastic bandits can achieve \tildeO(\sqrtn) expected regret against adaptive adversarial bandits. This complements previous results of Bubeck and Slivkins (2012) that show \tildeO(\sqrtn) expected adversarial regret with O((\log n)^2) stochastic pseudo-regret.

Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes

Mon, 06 Jun 2016 00:00:00 +0000

Strongly Rayleigh distributions are natural generalizations of product and determinantal probability distributions and satisfy the strongest form of negative dependence properties. We show that the "natural" Monte Carlo Markov Chain (MCMC) algorithm mixes rapidly in the support of a homogeneous strongly Rayleigh distribution. As a byproduct, our proof implies Markov chains can be used to efficiently generate approximate samples of a k-determinantal point process. This answers an open question raised by Deshpande and Rademacher which was studied recently by Kang, Li-Jegelka-Sra, and Rebeschini-Karbasi.

Efficient approaches for escaping higher order saddle points in non-convex optimization

Mon, 06 Jun 2016 00:00:00 +0000

Local search heuristics for non-convex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a \em local minimum, due to the existence of complicated saddle point structures in high dimensions. Many functions have \em degenerate saddle points such that the first and second order derivatives cannot distinguish them with local optima. In this paper we use higher order derivatives to escape these saddle points: we design the first efficient algorithm guaranteed to converge to a third order local optimum (while existing techniques are at most second order). We also show that it is NP-hard to extend this further to finding fourth order local optima.

Sign rank versus VC dimension

Mon, 06 Jun 2016 00:00:00 +0000

This work studies the maximum possible sign rank of N \times N sign matrices with a given VC dimension d. For d=1, this maximum is three. For d=2, this maximum is \tildeΘ(N^1/2). For d >2, similar but slightly less accurate statements hold. Our lower bounds improve over previous ones by Ben-David et al. and can be interpreted as exhibiting a weakness of kernel-based classifiers. Our upper bounds, on the other hand, can be interpreted as exhibiting the universality of kernel-based classifiers. The lower bounds are obtained by probabilistic constructions, using a theorem of Warren in real algebraic topology. The upper bounds are obtained using a result of Welzl about spanning trees with low stabbing number, and using the moment curve. The upper bound technique is also used to: (i) provide estimates on the number of classes of a given VC dimension, and the number of maximum classes of a given VC dimension – answering a question of Frankl from ’89, and (ii) design an efficient algorithm that provides an O(N/\log(N)) multiplicative approximation for the sign rank (computing the sign rank is equivalent to the existential theory of the reals). We also observe a general connection between sign rank and spectral gaps which is based on Forster’s argument. Consider the N \times N adjacency matrix of a ∆regular graph with a second eigenvalue of absolute value λand ∆≤N/2. We show that the sign rank of the signed version of this matrix is at least ∆/λ. We use this connection to prove the existence of a maximum class C⊆{\pm 1}^N with VC dimension 2 and sign rank \tildeΘ(N^1/2). This answers a question of Ben-David et al. regarding the sign rank of large VC classes. We also describe limitations of this approach, in the spirit of the Alon-Boppana theorem. We further describe connections to communication complexity, geometry, learning theory, and combinatorics.

Learning and Testing Junta Distributions

Mon, 06 Jun 2016 00:00:00 +0000

We consider the problem of learning distributions in the presence of irrelevant features. This problem is formalized by introducing a new notion of \emphk-junta distributions. Informally, a distribution \mathcalD over the domain \mathcalX^n is a \emphk-junta distribution with respect to another distribution \mathcalU over the same domain if there is a set J ⊆[n] of size |J| \le k that captures the difference between \mathcal D and \mathcalU. We show that it is possible to learn k-junta distributions with respect to the uniform distribution over the Boolean hypercube {0,1}^n in time \poly(n^k, 1/ε). This result is obtained via a new Fourier-based learning algorithm inspired by the Low-Degree Algorithm of Linial, Mansour, and Nisan (1993). We also consider the problem of testing whether an unknown distribution is a k-junta distribution with respect to the uniform distribution. We give a nearly-optimal algorithm for this task. Both the analysis of the algorithm and the lower bound showing its optimality are obtained by establishing connections between the problem of testing junta distributions and testing uniformity of weighted collections of distributions.

An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives

Mon, 06 Jun 2016 00:00:00 +0000

We consider a contextual version of multi-armed bandit problem with global knapsack constraints. In each round, the outcome of pulling an arm is a scalar reward and a resource consumption vector, both dependent on the context, and the global knapsack constraints require the total consumption for each resource to be below some pre-fixed budget. The learning agent competes with an arbitrary set of context-dependent policies. This problem was introduced by Badanidiyuru et al., who gave a computationally inefficient algorithm with near-optimal regret bounds for it. We give a \emphcomputationally efficient algorithm for this problem with slightly better regret bounds, by generalizing the approach of Dudik et al. for the non-constrained version of the problem. The computational time of our algorithm scales \emphlogarithmically in the size of the policy space. This answers the main open question of Badanidiyuru et al. We also extend our results to a variant where there are no knapsack constraints but the objective is an arbitrary Lipschitz concave function of the sum of outcome vectors.