Proceedings of Machine Learning Research

Lower Bounds on the Performance of Polynomial-time Algorithms for Sparse Linear Regression

Thu, 29 May 2014 00:00:00 +0000

Under a standard assumption in complexity theory (NP not in P/poly), we demonstrate a gap between the minimax prediction risk for sparse linear regression that can be achieved by polynomial-time algorithms, and that achieved by optimal algorithms. In particular, when the design matrix is ill-conditioned, the minimax prediction loss achievable by polynomial-time algorithms can be substantially greater than that of an optimal algorithm. This result is the first known gap between polynomial and optimal algorithms for sparse linear regression, and does not depend on conjectures in average-case complexity.

Community Detection via Random and Adaptive Sampling

Thu, 29 May 2014 00:00:00 +0000

In this paper, we consider networks consisting of a finite number of non-overlapping communities. To extract these communities, the interaction between pairs of nodes may be sampled from a large available data set, which allows a given node pair to be sampled several times. When a node pair is sampled, the observed outcome is a binary random variable, equal to 1 if nodes interact and to 0 otherwise. The outcome is more likely to be positive if nodes belong to the same communities. For a given budget of node pair samples or observations, we wish to jointly design a sampling strategy (the sequence of sampled node pairs) and a clustering algorithm that recover the hidden communities with the highest possible accuracy. We consider both non-adaptive and adaptive sampling strategies, and for both classes of strategies, we derive fundamental performance limits satisfied by any sampling and clustering algorithm. In particular, we provide necessary conditions for the existence of algorithms recovering the communities accurately as the network size grows large. We also devise simple algorithms that accurately reconstruct the communities when this is at all possible, hence proving that the proposed necessary conditions for accurate community detection are also sufficient. The classical problem of community detection in the stochastic block model can be seen as a particular instance of the problems consider here. But our framework covers more general scenarios where the sequence of sampled node pairs can be designed in an adaptive manner. The paper provides new results for the stochastic block model, and extends the analysis to the case of adaptive sampling.

Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results

Thu, 29 May 2014 00:00:00 +0000

The classical setting of community detection consists of networks exhibiting a clustered structure. To more accurately model real systems we consider a class of networks (i) whose edges may carry labels and (ii) which may lack a clustered structure. Specifically we assume that nodes possess latent attributes drawn from a general compact space and edges between two nodes are randomly generated and labeled according to some unknown distribution as a function of their latent attributes. Our goal is then to infer the edge label distributions from a partially observed network. We propose a computationally efficient spectral algorithm and show it allows for asymptotically correct inference when the average node degree could be as low as logarithmic in the total number of nodes. Conversely, if the average node degree is below a specific constant threshold, we show that no algorithm can achieve better inference than guessing without using the observations. As a byproduct of our analysis, we show that our model provides a general procedure to construct random graph models with a spectrum asymptotic to a pre-specified eigenvalue distribution such as a power-law distribution.

The Geometry of Losses

Thu, 29 May 2014 00:00:00 +0000

Loss functions are central to machine learning because they are the means by which the quality of a prediction is evaluated. Any loss that is not proper, or can not be transformed to be proper via a link function is inadmissible. All admissible losses for n-class problems can be obtained in terms of a convex body in \mathbbR^n. We show this explicitly and show how some existing results simplify when viewed from this perspective. This allows the development of a rich algebra of losses induced by binary operations on convex bodies (that return a convex body). Furthermore it allows us to define an “inverse loss” which provides a universal “substitution function” for the Aggregating Algorithm. In doing so we show a formal connection between proper losses and norms.

Open Problem: Shifting Experts on Easy Data

Thu, 29 May 2014 00:00:00 +0000

A number of online algorithms have been developed that have small additional loss (regret) compared to the best “shifting expert”. In this model, there is a set of experts and the comparator is the best partition of the trial sequence into a small number of segments, where the expert of smallest loss is chosen in each segment. The regret is typically defined for worst-case data / loss sequences. There has been a recent surge of interest in online algorithms that combine good worst-case guarantees with much improved performance on easy data. A practically relevant class of easy data is the case when the loss of each expert is iid and the best and second best experts have a gap between their mean loss. In the full information setting, the FlipFlop algorithm by De Rooij et al. (2014) combines the best of the iid optimal Follow-The-Leader (FL) and the worst-case-safe Hedge algorithms, whereas in the bandit information case SAO by Bubeck and Slivkins (2012) competes with the iid optimal UCB and the worst-case-safe EXP3. We ask the same question for the shifting expert problem. First, we ask what are the simple and efficient algorithms for the shifting experts problem when the loss sequence in each segment is iid with respect to a fixed but unknown distribution. Second, we ask how to efficiently unite the performance of such algorithms on easy data with worst-case robustness. A particular intriguing open problem is the case when the comparator shifts within a small subset of experts from a large set under the assumption that the losses in each segment are iid.

Follow the Leader with Dropout Perturbations

Thu, 29 May 2014 00:00:00 +0000

We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.

Localized Complexities for Transductive Learning

Thu, 29 May 2014 00:00:00 +0000

We show two novel concentration inequalities for suprema of empirical processes when sampling without replacement, which both take the variance of the functions into account. While these inequalities may potentially have broad applications in learning theory in general, we exemplify their significance by studying the transductive setting of learning theory. For which we provide the first excess risk bounds based on the localized complexity of the hypothesis class, which can yield fast rates of convergence also in the transductive learning setting. We give a preliminary analysis of the localized complexities for the prominent case of kernel classes.

Elicitation and Identification of Properties

Thu, 29 May 2014 00:00:00 +0000

Properties of distributions are real-valued functionals such as the mean, quantile or conditional value at risk. A property is elicitable if there exists a scoring function such that minimization of the associated risks recovers the property. We extend existing results to characterize the elicitability of properties in a general setting. We further relate elicitability to identifiability (a notion introduced by Osband) and provide a general formula describing all scoring functions for an elicitable property. Finally, we draw some connections to the theory of coherent risk measures.

Sample Compression for Multi-label Concept Classes

Thu, 29 May 2014 00:00:00 +0000

This paper studies labeled sample compression for multi-label concept classes. For a specific extension of the notion of VC-dimension to multi-label classes, we prove that every maximum multi-label class of dimension d has a sample compression scheme in which every sample is compressed to a subset of size at most d. We further show that every multi-label class of dimension 1 has a sample compression scheme using only sets of size at most 1. As opposed to the binary case, the latter result is not immediately implied by the former, since there are multi-label concept classes of dimension 1 that are not contained in maximum classes of dimension 1.

On the Consistency of Output Code Based Learning Algorithms for Multiclass Learning Problems

Thu, 29 May 2014 00:00:00 +0000

A popular approach to solving multiclass learning problems is to reduce them to a set of binary classification problems through some output code matrix: the widely used one-vs-all and all-pairs methods, and the error-correcting output code methods of Dietterich and Bakiri (1995), can all be viewed as special cases of this approach. In this paper, we consider the question of statistical consistency of such methods. We focus on settings where the binary problems are solved by minimizing a binary surrogate loss, and derive general conditions on the binary surrogate loss under which the one-vs-all and all-pairs code matrices yield consistent algorithms with respect to the multiclass 0-1 loss. We then consider general multiclass learning problems defined by a general multiclass loss, and derive conditions on the output code matrix and binary surrogates under which the resulting algorithm is consistent with respect to the target multiclass loss. We also consider \emphprobabilistic code matrices, where one reduces a multiclass problem to a set of \emphclass probability labeled binary problems, and show that these can yield benefits in the sense of requiring a smaller number of binary problems to achieve overall consistency. Our analysis makes interesting connections with the theory of proper composite losses (Buja et al., 2005; Reid and Williamson, 2010); these play a role in constructing the right ‘decoding’ for converting the predictions on the binary problems to the final multiclass prediction. To our knowledge, this is the first work that comprehensively studies consistency properties of output code based methods for multiclass learning.

Online Non-Parametric Regression

Thu, 29 May 2014 00:00:00 +0000

We establish optimal rates for online regression for arbitrary classes of regression functions in terms of the sequential entropy introduced in (Rakhlin et al., 2010). The optimal rates are shown to exhibit a phase transition analogous to the i.i.d./statistical learning case, studied in (Rakhlin et al., 2014b). In the frequently encountered situation when sequential entropy and i.i.d. empirical entropy match, our results point to the interesting phenomenon that the rates for statistical learning with squared loss and online nonparametric regression are the same. In addition to a non-algorithmic study of minimax regret, we exhibit a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online regression algorithms that can be computationally efficient. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.

Belief propagation, robust reconstruction and optimal recovery of block models

Thu, 29 May 2014 00:00:00 +0000

We consider the problem of reconstructing sparse symmetric block models with two blocks and connection probabilities a/n and b/n for inter- and intra-block edge probabilities respectively. It was recently shown that one can do better than a random guess if and only if (a-b)^2 > 2(a+b). Using a variant of Belief Propagation, we give a reconstruction algorithm that is \emphoptimal in the sense that if (a-b)^2 > C (a+b) for some constant C then our algorithm maximizes the fraction of the nodes labelled correctly. Along the way we prove some results of independent interest regarding \em robust reconstruction for the Ising model on regular and Poisson trees.

Bayes-Optimal Scorers for Bipartite Ranking

Thu, 29 May 2014 00:00:00 +0000

We address the following seemingly simple question: what is the Bayes-optimal scorer for a bipartite ranking risk? The answer to this question helps establish the consistency of the minimisation of surrogate bipartite risks, and elucidates the relationship between bipartite ranking and other established learning problems. We show that the answer is non-trivial in general, but may be easily determined for certain special cases using the theory of proper losses. Our analysis immediately establishes equivalences between several seemingly disparate risks for bipartite ranking, such as minimising a suitable class-probability estimation risk, and minimising the p-norm push risk proposed by Rudin (2009).

Learning without concentration

Thu, 29 May 2014 00:00:00 +0000

We obtain sharp bounds on the convergence rate of Empirical Risk Minimization performed in a convex class and with respect to the squared loss, without any boundedness assumptions on class members or on the target. Rather than resorting to a concentration-based argument, the method relies on a ‘small-ball’ assumption and thus holds for heavy-tailed sampling and heavy-tailed targets. Moreover, the resulting estimates scale correctly with the ‘noise level’ of the problem. When applied to the classical, bounded scenario, the method always improves the known estimates.

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Thu, 29 May 2014 00:00:00 +0000

We study algorithms for online linear optimization in Hilbert spaces, focusing on the case where the player is unconstrained. We develop a novel characterization of a large class of minimax algorithms, recovering, and even improving, several previous results as immediate corollaries. Moreover, using our tools, we develop an algorithm that provides a regret bound of O(U \sqrtT \log( U \sqrtT \log^2 T +1)), where U is the L_2 norm of an arbitrary comparator and both T and U are unknown to the player. This bound is optimal up to \sqrt\log \log T terms. When T is known, we derive an algorithm with an optimal regret bound (up to constant factors). For both the known and unknown T case, a Normal approximation to the conditional value of the game proves to be the key analysis tool.

An Inequality with Applications to Structured Sparsity and Multitask Dictionary Learning

Thu, 29 May 2014 00:00:00 +0000

From concentration inequalities for the suprema of Gaussian or Rademacher processes an inequality is derived. It is applied to sharpen existing and to derive novel bounds on the empirical Rademacher complexities of unit balls in various norms appearing in the context of structured sparsity and multitask dictionary learning or matrix factorization. A key role is played by the largest eigenvalue of the data covariance matrix.

Approachability in unknown games: Online learning meets multi-objective optimization

Thu, 29 May 2014 00:00:00 +0000

In the standard setting of approachability there are two players and a target set. The players play a repeated vector-valued game where one of them wants to have the average vector-valued payoff converge to the target set which the other player tries to exclude. We revisit the classical setting and consider the setting where the player has a preference relation between target sets: she wishes to approach the smallest (“best”) set possible given the observed average payoffs in hindsight. Moreover, as opposed to previous works on approachability, and in the spirit of online learning, we do not assume that there is a known game structure with actions for two players. Rather, the player receives an arbitrary vector-valued reward vector at every round. We show that it is impossible, in general, to approach the best target set in hindsight. We further propose a concrete strategy that approaches a non-trivial relaxation of the best-in-hindsight given the actual rewards. Our approach does not require projection onto a target set and amounts to switching between scalar regret minimization algorithms that are performed in episodes.

Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms

Thu, 29 May 2014 00:00:00 +0000

We consider stochastic multi-armed bandit problems where the expected reward is a Lipschitz function of the arm, and where the set of arms is either discrete or continuous. For discrete Lipschitz bandits, we derive asymptotic problem specific lower bounds for the regret satisfied by any algorithm, and propose OSLB and CKL-UCB, two algorithms that efficiently exploit the Lipschitz structure of the problem. In fact, we prove that OSLB is asymptotically optimal, as its asymptotic regret matches the lower bound. The regret analysis of our algorithms relies on a new concentration inequality for weighted sums of KL divergences between the empirical distributions of rewards and their true distributions. For continuous Lipschitz bandits, we propose to first discretize the action space, and then apply OSLB or CKL-UCB, algorithms that provably exploit the structure efficiently. This approach is shown, through numerical experiments, to significantly outperform existing algorithms that directly deal with the continuous set of arms. Finally the results and algorithms are extended to contextual bandits with similarities.

Most Correlated Arms Identification

Thu, 29 May 2014 00:00:00 +0000

We study the problem of finding the most mutually correlated arms among many arms. We show that adaptive arms sampling strategies can have significant advantages over the non-adaptive uniform sampling strategy. Our proposed algorithms rely on a novel correlation estimator. The use of this accurate estimator allows us to get improved results for a wide range of problem instances.

Compressed Counting Meets Compressed Sensing

Thu, 29 May 2014 00:00:00 +0000

Compressed sensing (sparse signal recovery) has been a popular and important research topic in recent years. By observing that natural signals (e.g., images or network data) are often nonnegative, we propose a framework for nonnegative signal recovery using \em Compressed Counting (CC). CC is a technique built on \em maximally-skewed α-stable random projections originally developed for data stream computations (e.g., entropy estimations). Our recovery procedure is computationally efficient in that it requires only one linear scan of the coordinates. In our settings, the signal \mathbfx∈\mathbbR^N is assumed to be nonnegative, i.e., x_i≥0, ∀i. We prove that, when α∈(0, 0.5], it suffices to use M=(C_α+o(1)) ε^-α \left(\sum_i=1^N x_i^α\right)\log N/δmeasurements so that, with probability 1-δ, all coordinates will be recovered within εadditive precision, in one scan of the coordinates. The constant C_α=1 when α\rightarrow0 and C_α=\pi/2 when α=0.5. In particular, when α\rightarrow0, the required number of measurements is essentially M=K\log N/δ, where K = \sum_i=1^N 1{x_i≠0} is the number of nonzero coordinates of the signal.

Uniqueness of Ordinal Embedding

Thu, 29 May 2014 00:00:00 +0000

Ordinal embedding refers to the following problem: all we know about an unknown set of points x_1,\ldots, x_n ∈\mathbbR^d are ordinal constraints of the form \|x_i - x_j\| < \|x_k - x_l\|; the task is to construct a realization y_1,\ldots, y_n ∈\mathbbR^d that preserves these ordinal constraints. It has been conjectured since the 1960ies that upon knowledge of all ordinal constraints a large but finite set of points can be approximately reconstructed up to a similarity transformation. The main result of our paper is a formal proof of this conjecture.

Open Problem: A (Missing) Boosting-type Convergence Result for AdaBoost.MH with Factorized Multi-class Classifiers

Thu, 29 May 2014 00:00:00 +0000

In (Kégl, 2014), we recently showed empirically that AdaBoost.MH is one of the best multi-class boosting algorithms when the classical one-against-all base classifiers, proposed in the seminal paper of Schapire and Singer (1999), are replaced by factorized base classifiers containing a binary classifier and a vote (or code) vector. In a slightly different setup, a similar factorization coupled with an iterative optimization of the two factors also proved to be an excellent approach (Gao and Koller, 2011). The main algorithmic advantage of our approach over the original setup of Schapire and Singer (1999) is that trees can be built in a straightforward way by using the binary classifier at inner nodes. In this open problem paper we take a step back to the basic setup of boosting generic multi-class factorized (Hamming) classifiers (so no trees), and state the classical problem of boosting-like convergence of the training error. Given a vote vector, training the classifier leads to a standard weighted binary classification problem. The main difficulty of proving the convergence is that, unlike in binary AdaBoost, the sum of the weights in this weighted binary classification problem is less than one, which means that the lower bound on the edge, coming from the weak learning condition, shrinks. To show the convergence, we need a (uniform) lower bound on the sum of the weights in this derived binary classification problem.

On the Complexity of A/B Testing

Thu, 29 May 2014 00:00:00 +0000

A/B testing refers to the task of determining the best option among two alternatives that yield random outcomes. We provide distribution-dependent lower bounds for the performance of A/B testing that improve over the results currently available both in the fixed-confidence (or δ-PAC) and fixed-budget settings. When the distribution of the outcomes are Gaussian, we prove that the complexity of the fixed-confidence and fixed-budget settings are equivalent, and that uniform sampling of both alternatives is optimal only in the case of equal variances. In the common variance case, we also provide a stopping rule that terminates faster than existing fixed-confidence algorithms. In the case of Bernoulli distributions, we show that the complexity of fixed-budget setting is smaller than that of fixed-confidence setting and that uniform sampling of both alternatives—though not optimal—is advisable in practice when combined with an appropriate stopping criterion.

Principal Component Analysis and Higher Correlations for Distributed Data

Thu, 29 May 2014 00:00:00 +0000

We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrix A=A^1 + A^2 + \ldots + A^s, with matrix A^t stored on server t and (2) computing a function of a vector a_1 + a_2 + \ldots + a_s, where server t has the vector a_t; this includes the well-studied special case of computing frequency moments and separable functions, as well as higher-order correlations such as the number of subgraphs of a specified type occurring in a graph. For both problems we give algorithms with nearly optimal communication, and in particular the only dependence on n, the size of the data, is in the number of bits needed to represent indices and words (O(\log n)).

Distribution-independent Reliable Learning

Thu, 29 May 2014 00:00:00 +0000

We study several questions in the \emphreliable agnostic learning framework of Kalai et al. (2009), which captures learning tasks in which one type of error is costlier than other types. A positive reliable classifier is one that makes no false positive errors. The goal in the \emphpositive reliable agnostic framework is to output a hypothesis with the following properties: (i) its false positive error rate is at most ε, (ii) its false negative error rate is at most εmore than that of the best positive reliable classifier from the class. A closely related notion is \emphfully reliable agnostic learning, which considers \emphpartial classifiers that are allowed to predict “unknown” on some inputs. The best fully reliable partial classifier is one that makes no errors and minimizes the probability of predicting “unknown”, and the goal in fully reliable learning is to output a hypothesis that is almost as good as the best fully reliable partial classifier from a class. For distribution-independent learning, the best known algorithms for PAC learning typically utilize polynomial threshold representations, while the state of the art agnostic learning algorithms use point-wise polynomial approximations. We show that \emphone-sided polynomial approximations, an intermediate notion between polynomial threshold representations and point-wise polynomial approximations, suffice for learning in the reliable agnostic settings. We then show that majorities can be fully reliably learned and disjunctions of majorities can be positive reliably learned, through constructions of appropriate one-sided polynomial approximations. Our fully reliable algorithm for majorities provides the first evidence that fully reliable learning may be strictly easier than agnostic learning. Our algorithms also satisfy strong attribute-efficiency properties, and in many cases they provide smooth tradeoffs between sample complexity and running time.

Open Problem: Efficient Online Sparse Regression

Thu, 29 May 2014 00:00:00 +0000

In practical scenarios, it is often necessary to be able to make predictions with very limited access to the features of any example. We provide one natural formulation as an online sparse regression problem with squared loss, and ask whether it is possible to achieve sublinear regret with efficient algorithms (i.e. polynomial running time in the natural parameters of the problem).

Multiarmed Bandits With Limited Expert Advice

Thu, 29 May 2014 00:00:00 +0000

We consider the problem of minimizing regret in the setting of advice-efficient multiarmed bandits with expert advice. We give an algorithm for the setting of K arms and N experts out of which we are allowed to query and use only M experts’ advice in each round, which has a regret bound of \tildeO\left(\sqrt\frac\min{K, M} NM T\right) after T rounds. We also prove that any algorithm for this problem must have expected regret at least \tildeΩ\left(\sqrt\frac\min{K, M} NMT\right), thus showing that our upper bound is nearly tight. This solves the COLT 2013 open problem of Seldin et al. (2013).

lil’ UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits

Thu, 29 May 2014 00:00:00 +0000

The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multi-armed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in the sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated logarithm (LIL). Inspired by the LIL, we construct our confidence bounds to explicitly account for the infinite time horizon of the algorithm. In addition, by using a novel stopping time for the algorithm we avoid a union bound over the arms that has been observed in other UCB-type algorithms. We prove that the algorithm is optimal up to constants and also show through simulations that it provides superior performance with respect to the state-of-the-art.

Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

Thu, 29 May 2014 00:00:00 +0000

We study the problem of learning a distribution from samples, when the underlying distribution is a mixture of product distributions over discrete domains. This problem is motivated by several practical applications such as crowdsourcing, recommendation systems, and learning Boolean functions. The existing solutions either heavily rely on the fact that the number of mixtures is finite or have sample/time complexity that is exponential in the number of mixtures. In this paper, we introduce a polynomial time/sample complexity method for learning a mixture of r discrete product distributions over {1, 2, …, \ell}^n, for general \ell and r. We show that our approach is consistent and further provide finite sample guarantees. We use recently developed techniques from tensor decompositions for moment matching. A crucial step in these approaches is to construct certain tensors with low-rank spectral decompositions. These tensors are typically estimated from the sample moments. The main challenge in learning mixtures of discrete product distributions is that the corresponding low-rank tensors cannot be obtained directly from the sample moments. Instead, we need to estimate a low-rank matrix using only off-diagonal entries, and estimate a tensor using a few linear measurements. We give an alternating minimization based method to estimate the low-rank matrix, and formulate the tensor estimation problem as a least-squares problem.

Volumetric Spanners: an Efficient Exploration Basis for Learning

Thu, 29 May 2014 00:00:00 +0000

Numerous machine learning problems require an \it exploration basis - a mechanism to explore the action space. We define a novel geometric notion of exploration basis with low variance called volumetric spanners, and give efficient algorithms to construct such bases. We show how efficient volumetric spanners give rise to an efficient and near-optimal regret algorithm for bandit linear optimization over general convex sets. Previously such results were known only for specific convex sets, or under special conditions such as the existence of an efficient self-concordant barrier for the underlying set.

Logistic Regression: Tight Bounds for Stochastic and Online Optimization

Thu, 29 May 2014 00:00:00 +0000

The logistic loss function is often advocated in machine learning and statistics as a smooth and strictly convex surrogate for the 0-1 loss. In this paper we investigate the question of whether these smoothness and convexity properties make the logistic loss preferable to other widely considered options such as the hinge loss. We show that in contrast to known asymptotic bounds, as long as the number of prediction/optimization iterations is sub exponential, the logistic loss provides no improvement over a generic non-smooth loss function such as the hinge loss. In particular we show that the convergence rate of stochastic logistic optimization is bounded from below by a polynomial in the diameter of the decision set and the number of prediction iterations, and provide a matching tight upper bound. This resolves the COLT open problem of McMahan and Streeter (2012).

Near-Optimal Herding

Thu, 29 May 2014 00:00:00 +0000

Herding is an algorithm of recent interest in the machine learning community, motivated by inference in Markov random fields. It solves the following sampling problem: given a set \mathcalX ⊂\mathbbR^d with mean μ, construct an infinite sequence of points from \mathcalX such that, for every t ≥1, the mean of the first t points in that sequence lies within Euclidean distance O(1/t) of μ. The classic Perceptron boundedness theorem implies that such a result actually holds for a wide class of algorithms, although the factors suppressed by the O(1/t) notation are exponential in d. Thus, to establish a non-trivial result for the sampling problem, one must carefully analyze the factors suppressed by the O(1/t) error bound. This paper studies the best error that can be achieved for the sampling problem. Known analysis of the Herding algorithm give an error bound that depends on geometric properties of \mathcalX but, even under favorable conditions, this bound depends linearly on d. We present a new polynomial-time algorithm that solves the sampling problem with error O\left(\sqrtd \log^2.5|\mathcalX| / t \right) assuming that \mathcalX is finite. Our algorithm is based on recent algorithmic results in \textitdiscrepancy theory. We also show that any algorithm for the sampling problem must have error Ω( \sqrtd / t ). This implies that our algorithm is optimal to within logarithmic factors.

Computational Limits for Matrix Completion

Thu, 29 May 2014 00:00:00 +0000

Matrix Completion is the problem of recovering an unknown real-valued low-rank matrix from a subsample of its entries. Important recent results show that the problem can be solved efficiently under the assumption that the unknown matrix is incoherent and the subsample is drawn uniformly at random. Are these assumptions necessary? It is well known that Matrix Completion in its full generality is NP-hard. However, little is known if we make additional assumptions such as incoherence and permit the algorithm to output a matrix of slightly higher rank. In this paper we prove that Matrix Completion remains computationally intractable even if the unknown matrix has rank 4 but we are allowed to output any constant rank matrix, and even if additionally we assume that the unknown matrix is incoherent and are shown 90% of the entries. This result relies on the conjectured hardness of the 4-Coloring problem. We also consider the positive semidefinite Matrix Completion problem. Here we show a similar hardness result under the standard assumption that \mathrmP\ne \mathrmNP. Our results greatly narrow the gap between existing feasibility results and computational lower bounds. In particular, we believe that our results give the first complexity-theoretic justification for why distributional assumptions are needed beyond the incoherence assumption in order to obtain positive results. On the technical side, we contribute several new ideas on how to encode hard combinatorial problems in low-rank optimization problems. We hope that these techniques will be helpful in further understanding the computational limits of Matrix Completion and related problems.

Fast matrix completion without the condition number

Thu, 29 May 2014 00:00:00 +0000

We give the first algorithm for Matrix Completion that achieves running time and sample complexity that is polynomial in the rank of the unknown target matrix, \emphlinear in the dimension of the matrix, and \emphlogarithmic in the condition number of the matrix. To the best of our knowledge, all previous algorithms either incurred a quadratic dependence on the condition number of the unknown matrix or a quadratic dependence on the dimension of the matrix. Our algorithm is based on a novel extension of Alternating Minimization which we show has theoretical guarantees under standard assumptions even in the presence of noise.

Stochastic Regret Minimization via Thompson Sampling

Thu, 29 May 2014 00:00:00 +0000

The Thompson Sampling (TS) policy is a widely implemented algorithm for the stochastic multi-armed bandit (MAB) problem. Given a prior distribution over possible parameter settings of the underlying reward distributions of the arms, at each time instant, the policy plays an arm with probability equal to the probability that this arm has largest mean reward conditioned on the current posterior distributions of the arms. This policy generalizes the celebrated “probability matching” heuristic which has been experimentally and widely observed in human decision making. However, despite its ubiquity, the Thompson Sampling policy is poorly understood. Our goal in this paper is to make progress towards understanding the empirical success of this policy. We proceed using the lens of approximation algorithms and problem definitions from stochastic optimization. We focus on an objective function termed \em stochastic regret that captures the expected number of times the policy plays an arm that is not the eventual best arm, where the expectation is over the prior distribution. Given such a definition, we show that TS is a 2–approximation to the optimal decision policy in two extreme but canonical scenarios. One such scenario is the two-armed bandit problem which is used as a calibration point in all bandit literature. The second scenario is stochastic optimization where the outcome of a random variable is revealed in a single play to a high or low deterministic value. We show that the 2 approximation is tight in both these scenarios. We provide an uniform analysis framework that in theory is capable of proving our conjecture that the TS policy is a 2–approximation to the optimal decision policy for minimizing stochastic regret, for any prior distribution and any time horizon.

Open Problem: Finding Good Cascade Sampling Processes for the Network Inference Problem

Thu, 29 May 2014 00:00:00 +0000

Information spreads across social and technological networks, but often the network structures are hidden and we only observe the traces left by the diffusion processes, called cascades. It is known that, under a popular continuous-time diffusion model, as long as the model parameters satisfy a natural incoherence condition, it is possible to recover the correct network structure with high probability if we observe O(d^3 \log N) cascades, where d is the maximum number of parents of a node and N is the total number of nodes. However, the incoherence condition depends, in a non-trivial way, on the source (node) distribution of the cascades, which is typically unknown. Our open problem is whether it is possible to design an active algorithm which samples the source locations in a sequential manner and achieves the same or even better sample complexity, e.g., o(d_i^3 \log N), than previous work.

Higher-Order Regret Bounds with Switching Costs

Thu, 29 May 2014 00:00:00 +0000

This work examines online linear optimization with full information and switching costs (SCs) and focuses on regret bounds that depend on properties of the loss sequences. The SCs considered are bounded functions of a pair of decisions, and regret is augmented with the total SC. We show under general conditions that for any normed SC, σ(\mathbfx,\mathbfx’)=\|\mathbfx-\mathbfx’\|, regret \textitcannot be bounded given only a bound Q on the quadratic variation of losses. With an additional bound Λon the total length of losses, we prove O(\sqrtQ+Λ) regret for Regularized Follow the Leader (RFTL). Furthermore, an O(\sqrtQ) bound holds for RFTL given a cost \|\mathbfx-\mathbfx’\|^2. By generalizing the Shrinking Dartboard algorithm, we also show an expected regret bound for the best expert setting with any SC, given bounds on the total loss of the best expert and the quadratic variation of any expert. As SCs vanish, all our bounds depend purely on quadratic variation. We apply our results to pricing options in an arbitrage-free market with proportional transaction costs. In particular, we upper bound the price of “at the money” call options, assuming bounds on the quadratic variation of a stock price and the minimum of summed gains and summed losses.

A second-order bound with excess losses

Thu, 29 May 2014 00:00:00 +0000

We study online aggregation of the predictions of experts, and first show new second-order regret bounds in the standard setting, which are obtained via a version of the Prod algorithm (and also a version of the polynomially weighted average algorithm) with multiple learning rates. These bounds are in terms of excess losses, the differences between the instantaneous losses suffered by the algorithm and the ones of a given expert. We then demonstrate the interest of these bounds in the context of experts that report their confidences as a number in the interval [0,1] using a generic reduction to the standard setting. We conclude by two other applications in the standard setting, which improve the known bounds in case of small excess losses and show a bounded regret against i.i.d. sequences of losses.

Open Problem: The Statistical Query Complexity of Learning Sparse Halfspaces

Thu, 29 May 2014 00:00:00 +0000

We consider the long-open problem of attribute-efficient learning of halfspaces. In this problem the learner is given random examples labeled by an unknown halfspace function f on \mathbbR^n. Further f is r-sparse, that is it depends on at most r out of n variables. An attribute-efficient learning algorithm is an algorithm that can output a hypothesis close to f using a polynomial in r and \log n number of examples (Blum, 1992). Despite a number of attempts and some partial progress, there are no efficient algorithms or hardness results for the problem. We propose a potentially easier question: what is the query complexity of this learning problem in the statistical query (SQ) model of Kearns (1998). We show that, as in the case of general PAC learning, the query complexity of attribute-efficient SQ learning of any concept class can be characterized by a combinatorial parameter of the concept class. The proposed question is then equivalent to estimating the value of this parameter for the concept class of halfspaces. A potentially simpler problem is to estimate this parameter for the concept class of decision lists, a subclass of halfspaces.

Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

Thu, 29 May 2014 00:00:00 +0000

In this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. (2006) that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private PAC and agnostic learning was studied in a number of prior works starting with (Kasiviswanathan et al., 2008) but a number of basic questions still remain open (Beimel et al. 2010; Chaudhuri and Hsu, 2011; Beimel et al., 2013a,b). Our main contribution is an equivalence between the sample complexity of differentially-private learning of a concept class C (or \mathrmSCDP(C)) and the randomized one-way communication complexity of the evaluation problem for concepts from C. Using this equivalence we prove the following bounds: \beginitemize \item \mathrmSCDP(C) = Ω(\mathrmLDim(C)), where \mathrmLDim(C) is the Littlestone’s (1987) dimension characterizing the number of mistakes in the online-mistake-bound learning model. This result implies that \mathrmSCDP(C) is different from the VC-dimension of C, resolving one of the main open questions from prior work. \item For any t, there exists a class C such that \mathrmLDim(C)=2 but \mathrmSCDP(C) ≥t. \item For any t, there exists a class C such that the sample complexity of (pure) α-differentially private PAC learning is Ω(t/α) but the sample complexity of the relaxed (α,β)-differentially private PAC learning is O(\log(1/β)/α). This resolves an open problem from (Beimel et al., 2013b). \enditemize We also obtain simpler proofs for a number of known related results. Our equivalence builds on a characterization of sample complexity by Beimel et al., (2013a) and our bounds rely on a number of known results from communication complexity.

Learning Coverage Functions and Private Release of Marginals

Thu, 29 May 2014 00:00:00 +0000

We study the problem of approximating and learning coverage functions. A function c: 2^[n] →\mathbfR^+ is a coverage function, if there exists a universe U with non-negative weights w(u) for each u ∈U and subsets A_1, A_2, \ldots, A_n of U such that c(S) = \sum_u ∈\cup_i ∈S A_i w(u). Alternatively, coverage functions can be described as non-negative linear combinations of monotone disjunctions. They are a natural subclass of submodular functions and arise in a number of applications. We give an algorithm that for any γ,δ>0, given random and uniform examples of an unknown coverage function c, finds a function h that approximates c within factor 1+γon all but δ-fraction of the points in time poly(n,1/γ,1/δ). This is the first fully-polynomial algorithm for learning an interesting class of functions in the demanding PMAC model of Balcan and Harvey (2011). Our algorithms are based on several new structural properties of coverage functions. Using the results in (Feldman and Kothari, 2014), we also show that coverage functions are learnable agnostically with excess \ell_1-error εover all product and symmetric distributions in time n^\log(1/ε). In contrast, we show that, without assumptions on the distribution, learning coverage functions is at least as hard as learning polynomial-size disjoint DNF formulas, a class of functions for which the best known algorithm runs in time 2^\tildeO(n^1/3) (Klivans and Servedio, 2004). As an application of our learning results, we give simple differentially-private algorithms for releasing monotone conjunction counting queries with low \em average error. In particular, for any k ≤n, we obtain private release of k-way marginals with average error \barα in time n^O(\log(1/\barα)).

Online Learning with Composite Loss Functions

Thu, 29 May 2014 00:00:00 +0000

We study a new class of online learning problems where each of the online algorithm’s actions is assigned an adversarial value, and the loss of the algorithm at each step is a known and deterministic function of the values assigned to its recent actions. This class includes problems where the algorithm’s loss is the \emphminimum over the recent adversarial values, the \emphmaximum over the recent values, or a \emphlinear combination of the recent values. We analyze the minimax regret of this class of problems when the algorithm receives bandit feedback, and prove that when the \emphminimum or \emphmaximum functions are used, the minimax regret is \widetilde Ω(T^2/3) (so called \emphhard online learning problems), and when a linear function is used, the minimax regret is \widetilde O(\sqrtT) (so called \empheasy learning problems). Previously, the only online learning problem that was known to be provably hard was the multi-armed bandit with switching costs.

Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians

Thu, 29 May 2014 00:00:00 +0000

We provide an algorithm for properly learning mixtures of two single-dimensional Gaussians without any separability assumptions. Given \tildeO(1/\varepsilon^2) samples from an unknown mixture, our algorithm outputs a mixture that is \varepsilon-close in total variation distance, in time \tildeO(1/\varepsilon^5). Our sample complexity is optimal up to logarithmic factors, and significantly improves upon both Kalai et al., whose algorithm has a prohibitive dependence on 1/\varepsilon, and Feldman et al., whose algorithm requires bounds on the mixture parameters and depends pseudo-polynomially in these parameters. One of our main contributions is an improved and generalized algorithm for selecting a good candidate distribution from among competing hypotheses. Namely, given a collection of N hypotheses containing at least one candidate that is \varepsilon-close to an unknown distribution, our algorithm outputs a candidate which is O(\varepsilon)-close to the distribution. The algorithm requires O(\logN/\varepsilon^2) samples from the unknown distribution and O(N \log N/\varepsilon^2) time, which improves previous such results (such as the Scheffé estimator) from a quadratic dependence of the running time on N to quasilinear. Given the wide use of such results for the purpose of hypothesis selection, our improved algorithm implies immediate improvements to any such use.

Optimal learners for multiclass problems

Thu, 29 May 2014 00:00:00 +0000

The fundamental theorem of statistical learning states that for \emphbinary classification problems, any Empirical Risk Minimization (ERM) learning rule has close to optimal sample complexity. In this paper we seek for a generic optimal learner for \emphmulticlass prediction. We start by proving a surprising result: a generic optimal multiclass learner must be \emphimproper, namely, it must have the ability to output hypotheses which do not belong to the hypothesis class, even though it knows that all the labels are generated by some hypothesis from the class. In particular, no ERM learner is optimal. This brings back the fundamental question of “how to learn”? We give a complete answer to this question by giving a new analysis of the one-inclusion multiclass learner of Rubinstein et el (2006) showing that its sample complexity is essentially optimal. Then, we turn to study the popular hypothesis class of generalized linear classifiers. We derive optimal learners that, unlike the one-inclusion algorithm, are computationally efficient. Furthermore, we show that the sample complexity of these learners is better than the sample complexity of the ERM rule, thus settling in negative an open question due to Collins (2005)

The Complexity of Learning Halfspaces using Generalized Linear Methods

Thu, 29 May 2014 00:00:00 +0000

Many popular learning algorithms (E.g. Regression, Fourier-Transform based algorithms, Kernel SVM and Kernel ridge regression) operate by reducing the problem to a convex optimization problem over a set of functions. These methods offer the currently best approach to several central problems such as learning half spaces and learning DNF’s. In addition they are widely used in numerous application domains. Despite their importance, there are still very few proof techniques to show limits on the power of these algorithms. We study the performance of this approach in the problem of (agnostically and improperly) learning halfspaces with margin γ. Let D be a distribution over labeled examples. The γ-margin error of a hyperplane h is the probability of an example to fall on the wrong side of h or at a distance \leγfrom it. The γ-margin error of the best h is denoted \mathrmErr_γ(D). An α(γ)-approximation algorithm receives γ,εas input and, using i.i.d. samples of D, outputs a classifier with error rate \le α(γ)\mathrmErr_γ(D) + ε. Such an algorithm is efficient if it uses \mathrmpoly(\frac1γ,\frac1ε) samples and runs in time polynomial in the sample size. The best approximation ratio achievable by an efficient algorithm is O\left(\frac1/γ\sqrt\log(1/γ)\right) and is achieved using an algorithm from the above class. Our main result shows that the approximation ratio of every efficient algorithm from this family must be \ge Ω\left(\frac1/γ\mathrmpoly\left(\log\left(1/γ\right)\right)\right), essentially matching the best known upper bound.

Open Problem: Online Local Learning

Thu, 29 May 2014 00:00:00 +0000

In many learning problems, we attempt to infer \emphglobal structure in the interest of making \emphlocal predictions. For example, we might try to infer the skills of the competitors in a tournament in order to predict who will win a match, or we might try to predict characteristics of users and films in order to predict which users will like which films. In even relatively simple settings of this type, it is typically NP-hard to find the latent data which best explain some observations. But do these complexity-theoretic obstructions actually prevent us from making good predictions? Because each prediction depends on only a small number of variables, it might be possible to make good predictions without actually finding a good global assignment. This may seem to be a purely technical distinction, but recent work has shown that several local prediction problems actually \emphare easy even though the corresponding global inference problem is hard. The question we pose is: how general is this phenomenon?

A Convex Formulation for Mixed Regression with Two Components: Minimax Optimal Rates

Thu, 29 May 2014 00:00:00 +0000

We consider the mixed regression problem with two components, under adversarial and stochastic noise. We give a convex optimization formulation that provably recovers the true solution, and provide upper bounds on the recovery errors for both arbitrary noise and stochastic noise settings. We also give matching minimax lower bounds (up to log factors), showing that under certain assumptions, our algorithm is information-theoretically optimal. Our results represent the first (and currently only known) tractable algorithm guaranteeing successful recovery with tight bounds on recovery errors and sample complexity.

Finding a most biased coin with fewest flips

Thu, 29 May 2014 00:00:00 +0000

We study the problem of learning a most biased coin among a set of coins by tossing the coins adaptively. The goal is to minimize the number of tosses until we identify a coin whose posterior probability of being most biased is at least 1-δfor a given δ. Under a particular probabilistic model, we give an optimal algorithm, i.e., an algorithm that minimizes the expected number of future tosses. The problem is closely related to finding the best arm in the multi-armed bandit problem using adaptive strategies. Our algorithm employs an optimal adaptive strategy—a strategy that performs the best possible action at each step after observing the outcomes of all previous coin tosses. Consequently, our algorithm is also optimal for any given starting history of outcomes. To our knowledge, this is the first algorithm that employs an optimal adaptive strategy under a Bayesian setting for this problem. Our proof of optimality employs mathematical tools from the area of Markov games.

Efficiency of conformalized ridge regression

Thu, 29 May 2014 00:00:00 +0000

Conformal prediction is a method of producing prediction sets that can be applied on top of a wide range of prediction algorithms. The method has a guaranteed coverage probability under the standard IID assumption regardless of whether the assumptions (often considerably more restrictive) of the underlying algorithm are satisfied. However, for the method to be really useful it is desirable that in the case where the assumptions of the underlying algorithm are satisfied, the conformal predictor loses little in efficiency as compared with the underlying algorithm (whereas being a conformal predictor, it has the stronger guarantee of validity). In this paper we explore the degree to which this additional requirement of efficiency is satisfied in the case of Bayesian ridge regression; we find that asymptotically conformal prediction sets differ little from ridge regression prediction intervals when the standard Bayesian assumptions are satisfied.

Open Problem: Tensor Decompositions: Algorithms up to the Uniqueness Threshold?

Thu, 29 May 2014 00:00:00 +0000

Uniqueness of Tensor Decompositions with Applications to Polynomial Identifiability

Thu, 29 May 2014 00:00:00 +0000

We give a robust version of the celebrated result of Kruskal on the uniqueness of tensor decompositions: given a tensor whose decomposition satisfies a robust form of Kruskal’s rank condition, we prove that it is possible to approximately recover the decomposition if the tensor is known up to a sufficiently small (inverse polynomial) error. Kruskal’s theorem has found many applications in proving the identifiability of parameters for various latent variable models and mixture models such as Hidden Markov models, topic models etc. Our robust version immediately implies identifiability using only polynomially many samples in many of these settings – an essential first step towards efficient learning algorithms. Our methods also apply to the “overcomplete” case, which has proved challenging in many applications. Given the importance of Kruskal’s theorem in the tensor literature, we expect that our robust version will have several applications beyond the settings we explore in this work.

The sample complexity of agnostic learning under deterministic labels

Thu, 29 May 2014 00:00:00 +0000

With the emergence of Machine Learning tools that allow handling data with a huge number of features, it becomes reasonable to assume that, over the full set of features, the true labeling is (almost) fully determined. That is, the labeling function is deterministic, but not necessarily a member of some known hypothesis class. However, agnostic learning of deterministic labels has so far received little research attention. We investigate this setting and show that it displays a behavior that is quite different from that of the fundamental results of the common (PAC) learning setups. First, we show that the sample complexity of learning a binary hypothesis class (with respect to deterministic labeling functions) is not fully determined by the VC-dimension of the class. For any d, we present classes of VC-dimension d that are learnable from \tilde O(d/ε)-many samples and classes that require samples of size Ω(d/ε^2). Furthermore, we show that in this setup, there are classes for which any proper learner has suboptimal sample complexity. While the class can be learned with sample complexity \tilde O(d/ε), any \emphproper (and therefore, any ERM) algorithm requires Ω(d/ε^2) samples. We provide combinatorial characterizations of both phenomena, and further analyze the utility of unlabeled samples in this setting. Lastly, we discuss the error rates of nearest neighbor algorithms under deterministic labels and additional niceness-of-data assumptions.

Open Problem: Tightness of maximum likelihood semidefinite relaxations

Thu, 29 May 2014 00:00:00 +0000

We have observed an interesting, yet unexplained, phenomenon: Semidefinite programming (SDP) based relaxations of maximum likelihood estimators (MLE) tend to be tight in recovery problems with noisy data, even when MLE cannot exactly recover the ground truth. Several results establish tightness of SDP based relaxations in the regime where exact recovery from MLE is possible. However, to the best of our knowledge, their tightness is not understood beyond this regime. As an illustrative example, we focus on the generalized Procrustes problem.

Preface

Thu, 29 May 2014 00:00:00 +0000

Resourceful Contextual Bandits

Thu, 29 May 2014 00:00:00 +0000

We study contextual bandits with ancillary constraints on resources, which are common in real-world applications such as choosing ads or dynamic pricing of items. We design the first algorithm for solving these problems that improves over a trivial reduction to the non-contextual case. We consider very general settings for both contextual bandits (arbitrary policy sets, Dudik et al. (2011)) and bandits with resource constraints (bandits with knapsacks, Badanidiyuru et al. (2013a)), and prove a regret guarantee with near-optimal statistical properties.

New Algorithms for Learning Incoherent and Overcomplete Dictionaries

Thu, 29 May 2014 00:00:00 +0000

In \em sparse recovery we are given a matrix A ∈\mathbbR^n\times m (“the dictionary”) and a vector of the form A X where X is \em sparse, and the goal is to recover X. This is a central notion in signal processing, statistics and machine learning. But in applications such as \em sparse coding, edge detection, compression and super resolution, the dictionary A is unknown and has to be learned from random examples of the form Y = AX where X is drawn from an appropriate distribution — this is the \em dictionary learning problem. In most settings, A is \em overcomplete: it has more columns than rows. This paper presents a polynomial-time algorithm for learning overcomplete dictionaries; the only previously known algorithm with provable guarantees is the recent work of Spielman et al. (2012) who who gave an algorithm for the undercomplete case, which is rarely the case in applications. Our algorithm applies to \em incoherent dictionaries which have been a central object of study since they were introduced in seminal work of Donoho and Huo (1999). In particular, a dictionary is μ-incoherent if each pair of columns has inner product at most μ/ \sqrtn. The algorithm makes natural stochastic assumptions about the unknown sparse vector X, which can contain k ≤c \min(\sqrtn/μ\log n, m^1/2 - η) non-zero entries (for any η> 0). This is close to the best k allowable by the best sparse recovery algorithms \em even if one knows the dictionary A exactly. Moreover, both the running time and sample complexity depend on \log 1/ε, where εis the target accuracy, and so our algorithms converge very quickly to the true dictionary. Our algorithm can also tolerate substantial amounts of noise provided it is incoherent with respect to the dictionary (e.g., Gaussian). In the noisy setting, our running time and sample complexity depend polynomially on 1/ε, and this is necessary.

The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

Thu, 29 May 2014 00:00:00 +0000

In this paper we show that very large mixtures of Gaussians are efficiently learnable in high dimension. More precisely, we prove that a mixture with known identical covariance matrices whose number of components is a polynomial of any fixed degree in the dimension n is polynomially learnable as long as a certain non-degeneracy condition on the means is satisfied. It turns out that this condition is generic in the sense of smoothed complexity, as soon as the dimensionality of the space is high enough. Moreover, we prove that no such condition can possibly exist in low dimension and the problem of learning the parameters is generically hard. In contrast, much of the existing work on Gaussian Mixtures relies on low-dimensional projections and thus hits an artificial barrier. Our main result on mixture recovery relies on a new “Poissonization"-based technique, which transforms a mixture of Gaussians to a linear map of a product distribution. The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture. In addition, we combine our low-dimensional hardness results for Gaussian mixtures with Poissonization to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined ICA in low dimension. To the best of our knowledge, this is the first such result in the literature. In addition to contributing to the problem of Gaussian mixture learning, we believe that this work is among the first steps toward better understanding the rare phenomenon of the “blessing of dimensionality" in the computational aspects of statistical inference.

Density-preserving quantization with application to graph downsampling

Thu, 29 May 2014 00:00:00 +0000

We consider the problem of vector quantization of i.i.d. samples drawn from a density p on \mathbbR^d. It is desirable that the representatives selected by the quantization algorithm have the same distribution p as the original sample points. However, quantization algorithms based on Euclidean distance, such as k-means, do not have this property. We provide a solution to this problem that takes the unweighted k-nearest neighbor graph on the sample as input. In particular, it does not need to have access to the data points themselves. Our solution generates quantization centers that are “evenly spaced". We exploit this property to downsample geometric graphs and show that our method produces sparse downsampled graphs. Our algorithm is easy to implement, and we provide theoretical guarantees on the performance of the proposed algorithm.

Robust Multi-objective Learning with Mentor Feedback

Thu, 29 May 2014 00:00:00 +0000

We study decision making when each action is described by a set of objectives, all of which are to be maximized. During the training phase, we have access to the actions of an outside agent (“mentor”). In the test phase, our goal is to maximally improve upon the mentor’s (unobserved) actions across all objectives. We present an algorithm with a vanishing regret compared with the optimal possible improvement, and show that our regret bound is the best possible. The bound is independent of the number of actions, and scales only as the logarithm of the number of objectives.

Learning Sparsely Used Overcomplete Dictionaries

Thu, 29 May 2014 00:00:00 +0000

We consider the problem of learning sparsely used overcomplete dictionaries, where each observation is a sparse combination of elements from an unknown overcomplete dictionary. We establish exact recovery when the dictionary elements are mutually incoherent. Our method consists of a clustering-based initialization step, which provides an approximate estimate of the true dictionary with guaranteed accuracy. This estimate is then refined via an iterative algorithm with the following alternating steps: 1) estimation of the dictionary coefficients for each observation through \ell_1 minimization, given the dictionary estimate, and 2) estimation of the dictionary elements through least squares, given the coefficient estimates. We establish that, under a set of sufficient conditions, our method converges at a linear rate to the true dictionary as well as the true coefficients for each observation.

Online Linear Optimization via Smoothing

Thu, 29 May 2014 00:00:00 +0000

We present a new optimization-theoretic approach to analyzing Follow-the-Leader style algorithms, particularly in the setting where perturbations are used as a tool for regularization. We show that adding a strongly convex penalty function to the decision rule and adding stochastic perturbations to data correspond to deterministic and stochastic smoothing operations, respectively. We establish an equivalence between “Follow the Regularized Leader” and “Follow the Perturbed Leader” up to the smoothness properties. This intuition leads to a new generic analysis framework that recovers and improves the previous known regret bounds of the class of algorithms commonly known as Follow the Perturbed Leader.