Proceedings of Machine Learning ResearchProceedings of The 28th Conference on Learning Theory
Held in Paris, France on 03-06 July 2015
Published as Volume 40 by the Proceedings of Machine Learning Research on 26 June 2015.
Volume Edited by:
Peter Grünwald
Elad Hazan
Satyen Kale
Series Editors:
Neil D. Lawrence
Mark Reid
http://proceedings.mlr.press/v40/
Mon, 29 May 2017 07:23:37 +0000Mon, 29 May 2017 07:23:37 +0000Jekyll v3.4.3On Convergence of Emphatic Temporal-Difference LearningWe consider emphatic temporal-difference learning algorithms for policy evaluation in discounted Markov decision processes with finite spaces. Such algorithms were recently proposed by Sutton, Mahmood, and White (2015) as an improved solution to the problem of divergence of off-policy temporal-difference learning with linear function approximation. We present in this paper the first convergence proofs for two emphatic algorithms, ETD(λ) and ELSTD(λ). We prove, under general off-policy conditions, the convergence in L^1 for ELSTD(λ) iterates, and the almost sure convergence of the approximate value functions calculated by both algorithms using a single infinitely long trajectory. Our analysis involves new techniques with applications beyond emphatic algorithms leading, for example, to the first proof that standard TD(λ) also converges under off-policy training for λsufficiently large.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Yu15.html
http://proceedings.mlr.press/v40/Yu15.htmlMax vs Min: Tensor Decomposition and ICA with nearly Linear Sample ComplexityWe present a simple, general technique for reducing the sample complexity of matrix and tensor decomposition algorithms applied to distributions. We use the technique to give a polynomial-time algorithm for standard ICA with sample complexity nearly linear in the dimension, thereby improving substantially on previous bounds. The analysis is based on properties of random polynomials, namely the spacings of an ensemble of polynomials. Our technique also applies to other applications of tensor decompositions, including spherical Gaussian mixture models.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Vempala15.html
http://proceedings.mlr.press/v40/Vempala15.htmlRegularized Linear Regression: A Precise Analysis of the Estimation ErrorNon-smooth regularized convex optimization procedures have emerged as a powerful tool to recover structured signals (sparse, low-rank, etc.) from (possibly compressed) noisy linear measurements. We focus on the problem of linear regression and consider a general class of optimization methods that minimize a loss function measuring the misfit of the model to the observations with an added structured-inducing regularization term. Celebrated instances include the LASSO, Group-LASSO, Least-Absolute Deviations method, etc.. We develop a quite general framework for how to determine precise prediction performance guaranties (e.g. mean-square-error) of such methods for the case of Gaussian measurement ensemble. The machinery builds upon Gordon’s Gaussian min-max theorem under additional convexity assumptions that arise in many practical applications. This theorem associates with a primary optimization (PO) problem a simplified auxiliary optimization (AO) problem from which we can tightly infer properties of the original (PO), such as the optimal cost, the norm of the optimal solution, etc. Our theory applies to general loss functions and regularization and provides guidelines on how to optimally tune the regularizer coefficient when certain structural properties (such as sparsity level, rank, etc.) are known.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Thrampoulidis15.html
http://proceedings.mlr.press/v40/Thrampoulidis15.htmlConvex Risk Minimization and Conditional Probability EstimationThis paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem. Unlike most previous work, we give results that are general enough to include cases in which no minimum exists, as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that any sequence of predictors minimizing convex risk over the source distribution will converge to this unique model when the class of predictors is linear (but potentially of infinite dimension). Secondly, we show the same result holds for \emphempirical risk minimization whenever this class of predictors is finite dimensional, where the essential technical contribution is a norm-free generalization bound. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Telgarsky15.html
http://proceedings.mlr.press/v40/Telgarsky15.htmlInteractive Fingerprinting Codes and the Hardness of Preventing False DiscoveryWe show an essentially tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is “close” to the correct expectation over the distribution. This question was recently studied by Dwork et al. (2015), who showed how to answer \tildeΩ(n^2) queries efficiently, and also by Hardt and Ulman (2014), who showed that answering \tildeO(n^3) queries is hard. We close the gap between the two bounds and show that, under a standard hardness assumption, there is no computationally efficient algorithm that, given n samples from an unknown distribution, can give valid answers to O(n^2) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be \emphdifferentially private. We obtain our results using a new connection between the problem of answering adaptively chosen statistical queries and a combinatorial object called an \emphinteractive fingerprinting code Fiat and Tassa (2001). In order to optimize our hardness result, we give a new Fourier-analytic approach to analyzing fingerprinting codes that is simpler, more flexible, and yields better parameters than previous constructions.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Steinke15.html
http://proceedings.mlr.press/v40/Steinke15.htmlMinimax rates for memory-bounded sparse linear regressionWe establish a minimax lower bound of Ω(\frackdBε) on the sample size needed to estimate parameters in a k-sparse linear regression of dimension d under memory restrictions to B bits, where εis the \ell_2 parameter error. When the covariance of the regressors is the identity matrix, we also provide an algorithm that uses \tildeO(B+k) bits and requires \tildeO(\frackdBε^2) observations to achieve error ε. Our lower bound also holds in the more general communication-bounded setting, where instead of a memory bound, at most B bits of information are allowed to be (adaptively) communicated about each sample. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Steinhardt15.html
http://proceedings.mlr.press/v40/Steinhardt15.htmlOpen Problem: Recursive Teaching Dimension Versus VC DimensionThe Recursive Teaching Dimension (RTD) of a concept class \mathcalC is a complexity parameter referring to the worst-case number of labelled examples needed to learn any target concept in \mathcalC from a teacher following the recursive teaching model. It is the first teaching complexity notion for which interesting relationships to the VC dimension (VCD) have been established. In particular, for finite maximum classes of a given VCD d, the RTD equals d. To date, there is no concept class known for which the ratio of RTD over VCD exceeds 3/2. However, the only known upper bound on RTD in terms of VCD is exponential in the VCD and depends on the size of the concept class. We pose the following question: is the RTD upper-bounded by a function that grows only linearly in the VCD? Answering this question would further our understanding of the relationships between the complexity of teaching and the complexity of learning from randomly chosen examples. In addition, the answer to this question, whether positive or negative, is known to have implications on the study of the long-standing open sample compression conjecture, which claims that every concept class of VCD d has a sample compression scheme in which samples for concepts in the class are compressed to subsets of size no larger than d.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Simon15b.html
http://proceedings.mlr.press/v40/Simon15b.htmlAn Almost Optimal PAC AlgorithmThe best currently known general lower and upper bounds on the number of labeled examples needed for learning a concept class in the PAC framework (the realizable case) do not perfectly match: they leave a gap of order \log(1/ε) (resp. a gap which is logarithmic in another one of the relevant parameters). It is an unresolved question whether there exists an “optimal PAC algorithm” which establishes a general upper bound with precisely the same order of magnitude as the general lower bound. According to a result of Auer and Ortner, there is no way for showing that arbitrary consistent algorithms are optimal because they can provably differ from optimality by factor \log(1/ε). In contrast to this result, we show that every consistent algorithm L (even a provably suboptimal one) induces a family (L_K)_K\ge1 of PAC algorithms (with 2K-1 calls of L as a subroutine) which come very close to optimality: the number of labeled examples needed by L_K exceeds the general lower bound only by factor \ell_K(1/\epsillon) where \ell_K denotes (a truncated version of) the K-times iterated logarithm. Moreover, L_K is applicable to any concept class C of finite VC-dimension and it can be implemented efficiently whenever the consistency problem for C is feasible. We show furthermore that, for every consistent algorithm L, L_2 is an optimal PAC algorithm for precisely the same concept classes which were used by Auer and Ortner for showing the existence of suboptimal consistent algorithms. This can be seen as an indication that L_K may have an even better performance than it is suggested by our worstcase analysis.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Simon15a.html
http://proceedings.mlr.press/v40/Simon15a.htmlOn the Complexity of Bandit Linear OptimizationWe study the attainable regret for online linear optimization problems with bandit feedback, where unlike the full-information setting, the player can only observe its own loss rather than the full loss vector. We show that the price of bandit information in this setting can be as large as d, disproving the well-known conjecture (Danie et al. (2007)) that the regret for bandit linear optimization is at most \sqrtd times the full-information regret. Surprisingly, this is shown using “trivial” modifications of standard domains, which have no effect in the full-information setting. This and other results we present highlight some interesting differences between full-information and bandit learning, which were not considered in previous literature.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Shamir15.html
http://proceedings.mlr.press/v40/Shamir15.htmlGeneralized Mixability via Entropic DualityMixability is a property of a loss which characterizes when constant regret is possible in the game of prediction with expert advice. We show that a key property of mixability generalizes, and the \exp and \log operations present in the usual theory are not as special as one might have thought. In doing so we introduce a more general notion of Φ-mixability where Φis a general entropy (\emphi.e., any convex function on probabilities). We show how a property shared by the convex dual of any such entropy yields a natural algorithm (the minimizer of a regret bound) which, analogous to the classical Aggregating Algorithm, is guaranteed a constant regret when used with Φ-mixable losses. We characterize which Φhave non-trivial Φ-mixable losses and relate Φ-mixability and its associated Aggregating Algorithm to potential-based methods, a Blackwell-like condition, mirror descent, and risk measures from finance. We also define a notion of “dominance” between different entropies in terms of bounds they guarantee and conjecture that classical mixability gives optimal bounds, for which we provide some supporting empirical evidence.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Reid15.html
http://proceedings.mlr.press/v40/Reid15.htmlFast Mixing for Discrete Point ProcessesWe investigate the systematic mechanism for designing fast mixing Markov chain Monte Carlo algorithms to sample from discrete point processes under the Dobrushin uniqueness condition for Gibbs measures. Discrete point processes are defined as probability distributions μ(S)∝\exp(βf(S)) over all subsets S∈2^V of a finite set V through a bounded set function f:2^V→\mathbbR and a parameter β>0. A subclass of discrete point processes characterized by submodular functions (which include log-submodular distributions, submodular point processes, and determinantal point processes) has recently gained a lot of interest in machine learning and shown to be effective for modeling diversity and coverage. We show that if the set function (not necessarily submodular) displays a natural notion of decay of correlation, then, for βsmall enough, it is possible to design fast mixing Markov chain Monte Carlo methods that yield error bounds on marginal approximations that do not depend on the size of the set V. The sufficient conditions that we derive involve a control on the (discrete) Hessian of set functions, a quantity that has not been previously considered in the literature. We specialize our results for submodular functions, and we discuss canonical examples where the Hessian can be easily controlled.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Rebeschini15.html
http://proceedings.mlr.press/v40/Rebeschini15.htmlHierarchies of Relaxations for Online Prediction Problems with Evolving ConstraintsWe study online prediction where regret of the algorithm is measured against a benchmark defined via evolving constraints. This framework captures online prediction on graphs, as well as other prediction problems with combinatorial structure. A key aspect here is that finding the optimal benchmark predictor (even in hindsight, given all the data) might be computationally hard due to the combinatorial nature of the constraints. Despite this, we provide polynomial-time prediction algorithms that achieve low regret against combinatorial benchmark sets. We do so by building improper learning algorithms based on two ideas that work together. The first is to alleviate part of the computational burden through random playout, and the second is to employ Lasserre semidefinite hierarchies to approximate the resulting integer program. Interestingly, for our prediction algorithms, we only need to compute the values of the semidefinite programs and not the rounded solutions. However, the integrality gap for Lasserre hierarchy does enter the generic regret bound in terms of Rademacher complexity of the benchmark set. This establishes a trade-off between the computation time and the regret bound of the algorithm.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Rakhlin15.html
http://proceedings.mlr.press/v40/Rakhlin15.htmlBatched Bandit ProblemsMotivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic multi-armed bandits under the constraint that the employed policy must split trials into a small number of batches. Our results show that a very small number of batches gives already close to minimax optimal regret bounds and we also evaluate the number of trials in each batch. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Perchet15.html
http://proceedings.mlr.press/v40/Perchet15.htmlPartitioning Well-Clustered Graphs: Spectral Clustering Works!In this work we study the widely used \emphspectral clustering algorithms, i.e. partition a graph into k clusters via (1) embedding the vertices of a graph into a low-dimensional space using the bottom eigenvectors of the Laplacian matrix, and (2) partitioning embedded points via k-means algorithms. We show that, for a wide class of \emphwell-clustered graphs, spectral clustering algorithms can give a good approximation of the optimal clustering. To the best of our knowledge, it is the \emphfirst theoretical analysis of spectral clustering algorithms for a wide family of graphs, even though such approach was proposed in the early 1990s and has comprehensive applications. We also give a nearly-linear time algorithm for partitioning well-clustered graphs, which is based on heat kernel embeddings and approximate nearest neighbor data structures.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Peng15.html
http://proceedings.mlr.press/v40/Peng15.htmlCortical Learning via PredictionWhat is the mechanism of learning in the brain? Despite breathtaking advances in neuroscience, and in machine learning, we do not seem close to an answer. Using Valiant’s neuronal model as a foundation, we introduce PJOIN (for “predictive join"), a primitive that combines association and prediction. We show that PJOIN can be implemented naturally in Valiant’s conservative, formal model of cortical computation. Using PJOIN — and almost nothing else — we give a simple algorithm for unsupervised learning of arbitrary ensembles of binary patterns (solving an open problem in Valiant’s work). This algorithm relies crucially on prediction, and entails significant downward traffic (“feedback") while parsing stimuli. Prediction and feedback are well-known features of neural cognition and, as far as we know, this is the first theoretical prediction of their essential role in learning.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Papadimitriou15.html
http://proceedings.mlr.press/v40/Papadimitriou15.htmlNorm-Based Capacity Control in Neural NetworksWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Neyshabur15.html
http://proceedings.mlr.press/v40/Neyshabur15.htmlFirst-order regret bounds for combinatorial semi-banditsWe consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Neu15.html
http://proceedings.mlr.press/v40/Neu15.htmlOnline Density Estimation of Bradley-Terry ModelsWe consider an online density estimation problem for the Bradley-Terry model, where each model parameter defines the probability of a match result between any pair in a set of n teams. The problem is hard because the loss function (i.e., the negative log-likelihood function in our problem setting) is not convex. To avoid the non-convexity, we can change parameters so that the loss function becomes convex with respect to the new parameter. But then the radius K of the reparameterized domain may be infinite, where K depends on the outcome sequence. So we put a mild assumption that guarantees that K is finite. We can thus employ standard online convex optimization algorithms, namely OGD and ONS, over the reparameterized domain, and get regret bounds O(n^\frac12(\ln K)\sqrtT) and O(n^\frac32K\ln T), respectively, where T is the horizon of the game. The bounds roughly means that OGD is better when K is large while ONS is better when K is small. But how large can K be? We show that K can be as large as Θ(T^n-1), which implies that the worst case regret bounds of OGD and ONS are O(n^\frac32\sqrtT\ln T) and \tildeO(n^\frac32(T)^n-1), respectively. We then propose a version of Follow the Regularized Leader, whose regret bound is close to the minimum of those of OGD and ONS. In other words, our algorithm is competitive with both for a wide range of values of K. In particular, our algorithm achieves the worst case regret bound O(n^\frac52T^\frac13 \ln T), which is slightly better than OGD with respect to T. In addition, our algorithm works without the knowledge K, which is a practical advantage.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Matsumoto15.html
http://proceedings.mlr.press/v40/Matsumoto15.htmlCorrelation Clustering with Noisy Partial InformationIn this paper, we propose and study a semi-random model for the Correlation Clustering problem on arbitrary graphs G. We give two approximation algorithms for Correlation Clustering instances from this model. The first algorithm finds a solution of value (1+ δ)\mathrmopt-cost + O_δ(n\log^3 n) with high probability, where \mathrmopt-cost is the value of the optimal solution (for every δ> 0). The second algorithm finds the ground truth clustering with an arbitrarily small classification error η(under some additional assumptions on the instance).Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Makarychev15.html
http://proceedings.mlr.press/v40/Makarychev15.htmlLower and Upper Bounds on the Generalization of Stochastic Exponentially Concave OptimizationIn this paper we derive \textithigh probability lower and upper bounds on the excess risk of stochastic optimization of exponentially concave loss functions. Exponentially concave loss functions encompass several fundamental problems in machine learning such as squared loss in linear regression, logistic loss in classification, and negative logarithm loss in portfolio management. We demonstrate an O(d \log T/T) upper bound on the excess risk of stochastic online Newton step algorithm, and an O(d/T) lower bound on the excess risk of any stochastic optimization method for \textitsquared loss, indicating that the obtained upper bound is optimal up to a logarithmic factor. The analysis of upper bound is based on recent advances in concentration inequalities for bounding self-normalized martingales, which is interesting by its own right, and the proof technique used to achieve the lower bound is a probabilistic method and relies on an information-theoretic minimax analysis.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Mahdavi15.html
http://proceedings.mlr.press/v40/Mahdavi15.htmlAchieving All with No Parameters: AdaNormalHedgeWe study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information. The main component of this work is an improved version of the NormalHedge.DT algorithm (Luo and Schapire, 2014), called AdaNormalHedge. On one hand, this new algorithm ensures small regret when the competitor has small loss and almost constant regret when the losses are stochastic. On the other hand, the algorithm is able to compete with any convex combination of the experts simultaneously, with a regret in terms of the relative entropy of the prior and the competitor. This resolves an open problem proposed by Chaudhuri et al. (2009) and Chernov and Vovk (2010). Moreover, we extend the results to the sleeping expert setting and provide two applications to illustrate the power of AdaNormalHedge: 1) competing with time-varying unknown competitors and 2) predicting almost as well as the best pruning tree. Our results on these applications significantly improve previous work from different aspects, and a special case of the first application resolves another open problem proposed by Warmuth and Koolen (2014) on whether one can simultaneously achieve optimal shifting regret for both adversarial and stochastic losses.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Luo15.html
http://proceedings.mlr.press/v40/Luo15.htmlLearning with Square Loss: Localization through Offset Rademacher ComplexityWe consider regression with square loss and general classes of functions without the boundedness assumption. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step estimator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of \citepRakSriTsy15 for the bounded case while also providing guarantees without the boundedness assumption.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Liang15.html
http://proceedings.mlr.press/v40/Liang15.htmlBad Universal Priors and Notions of OptimalityA big open question of algorithmic information theory is the choice of the universal Turing machine (UTM). For Kolmogorov complexity and Solomonoff induction we have invariance theorems: the choice of the UTM changes bounds only by a constant. For the universally intelligent agent AIXI (Hutter, 2005) no invariance theorem is known. Our results are entirely negative: we discuss cases in which unlucky or adversarial choices of the UTM cause AIXI to misbehave drastically. We show that Legg-Hutter intelligence and thus balanced Pareto optimality is entirely subjective, and that every policy is Pareto optimal in the class of all computable environments. This undermines all existing optimality properties for AIXI. While it may still serve as a gold standard for AI, our results imply that AIXI is a \emphrelative theory, dependent on the choice of the UTM.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Leike15.html
http://proceedings.mlr.press/v40/Leike15.htmlLow Rank Matrix Completion with Exponential Family NoiseThe matrix completion problem consists in reconstructing a matrix from a sample of entries, possibly observed with noise. A popular class of estimator, known as nuclear norm penalized estimators, are based on minimizing the sum of a data fitting term and a nuclear norm penalization. Here, we investigate the case where the noise distribution belongs to the exponential family and is sub-exponential. Our framework allows for a general sampling scheme. We first consider an estimator defined as the minimizer of the sum of a log-likelihood term and a nuclear norm penalization and prove an upper bound on the Frobenius prediction risk. The rate obtained improves on previous works on matrix completion for exponential family. When the sampling distribution is known, we propose another estimator and prove an oracle inequality \em w.r.t. the Kullback-Leibler prediction risk, which translates immediately into an upper bound on the Frobenius prediction risk. Finally, we show that all the rates obtained are minimax optimal up to a logarithmic factor.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Lafond15.html
http://proceedings.mlr.press/v40/Lafond15.htmlAlgorithms for Lipschitz Learning on GraphsWe develop fast algorithms for solving regression problems on graphs where one is given the value of a function at some vertices, and must find its smoothest possible extension to all vertices. The extension we compute is the absolutely minimal Lipschitz extension, and is the limit for large p of p-Laplacian regularization. We present an algorithm that computes a minimal Lipschitz extension in expected linear time, and an algorithm that computes an absolutely minimal Lipschitz extension in expected time \widetildeO (m n). The latter algorithm has variants that seem to run much faster in practice. These extensions are particularly amenable to regularization: we can perform l_0-regularization on the given values in polynomial time and l_1-regularization on the initial function values and on graph edge weights in time \widetildeO (m^3/2). Our definitions and algorithms naturally extend to directed graphs.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Kyng15.html
http://proceedings.mlr.press/v40/Kyng15.htmlOpen Problem: Learning Quantum Circuits with QueriesWe pose an open problem on the complexity of learning the behavior of a quantum circuit with value injection queries. We define the learning model for quantum circuits and give preliminary results. Using the test-path lemma of Angluin et al. (2009a), we show that new ideas are likely needed to tackle value injection queries for the quantum setting.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Kun15.html
http://proceedings.mlr.press/v40/Kun15.htmlHierarchical Label Queries with Data-Dependent PartitionsGiven a joint distribution P_X, Y over a space \Xcal and a label set \Ycal=\braces0, 1, we consider the problem of recovering the labels of an unlabeled sample with as few label queries as possible. The recovered labels can be passed to a passive learner, thus turning the procedure into an active learning approach. We analyze a family of labeling procedures based on a hierarchical clustering of the data. While such labeling procedures have been studied in the past, we provide a new parametrization of P_X, Y that captures their behavior in general low-noise settings, and which accounts for data-dependent clustering, thus providing new theoretical underpinning to practically used tools. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Kpotufe15.html
http://proceedings.mlr.press/v40/Kpotufe15.htmlOpen Problem: Online Sabotaged Shortest PathThere has been much work on extending the prediction with expert advice methodology to the case when experts are composed of components and there are combinatorially many such experts. One of the core examples is the Online Shortest Path problem where the components are edges and the experts are paths. In this note we revisit this online routing problem in the case where in each trial some of the edges or components are sabotaged / blocked. In the vanilla expert setting a known method can solve this extension where experts are now awake or asleep in each trial. We ask whether this technology can be upgraded efficiently to the case when at each trial every component can be awake or asleep. It is easy get to get an initial regret bound by using combinatorially many experts. However it is open whether there are efficient algorithms achieving the same regret.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Koolen15b.html
http://proceedings.mlr.press/v40/Koolen15b.htmlSecond-order Quantile Methods for Experts and Combinatorial GamesWe aim to design strategies for sequential decision making that adjust to the difficulty of the learning problem. We study this question both in the setting of prediction with expert advice, and for more general combinatorial decision tasks. We are not satisfied with just guaranteeing minimax regret rates, but we want our algorithms to perform significantly better on easy data. Two popular ways to formalize such adaptivity are second-order regret bounds and quantile bounds. The underlying notions of ‘easy data’, which may be paraphrased as “the learning problem has small variance” and “multiple decisions are useful”, are synergetic. But even though there are sophisticated algorithms that exploit one of the two, no existing algorithm is able to adapt to both. The difficulty in combining the two notions lies in tuning a parameter called the learning rate, whose optimal value behaves non-monotonically. We introduce a potential function for which (very surprisingly!) it is sufficient to simply put a prior on learning rates; an approach that does not work for any previous method. By choosing the right prior we construct efficient algorithms and show that they reap both benefits by proving the first bounds that are both second-order and incorporate quantiles.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Koolen15a.html
http://proceedings.mlr.press/v40/Koolen15a.htmlRegret Lower Bound and Optimal Algorithm in Dueling Bandit ProblemWe study the K-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. We introduce a tight asymptotic regret lower bound that is based on the information divergence. An algorithm that is inspired by the Deterministic Minimum Empirical Divergence algorithm (Honda and Takemura, 2010) is proposed, and its regret is analyzed. The proposed algorithm is found to be the first one with a regret upper bound that matches the lower bound. Experimental comparisons of dueling bandit algorithms show that the proposed algorithm significantly outperforms existing ones.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Komiyama15.html
http://proceedings.mlr.press/v40/Komiyama15.htmlOnline PCA with Spectral BoundsThis paper revisits the online PCA problem. Given a stream of n vectors x_t ∈\mathbbR^d (columns of X) the algorithm must output y_t ∈\mathbbR^\ell (columns of Y) before receiving x_t+1. The goal of online PCA is to simultaneously minimize the target dimension \ell and the error \|X - (XY^\scriptstyle \textrm +)Y\|^2. We describe two simple and deterministic algorithms. The first, receives a parameter ∆and guarantees that \|X - (XY^\scriptstyle \textrm +)Y\|^2 is not significantly larger than ∆. It requires a target dimension of \ell = O(k/ε) for any k,εsuch that ∆\ge ε\sigma_1^2 + \sigma_k+1^2, with \sigma_i being the i’th singular value of X. The second receives k and εand guarantees that \|X - (XY^\scriptstyle \textrm +)Y\|^2 \le ε\sigma_1^2 + \sigma_k+1^2. It requires a target dimension of O( k\log n/ε^2). Different models and algorithms for Online PCA were considered in the past. This is the first that achieves a bound on the spectral norm of the residual matrix. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Karnin15.html
http://proceedings.mlr.press/v40/Karnin15.htmlMCMC LearningThe theory of learning under the uniform distribution is rich and deep, with connections to cryptography, computational complexity, and the analysis of boolean functions to name a few areas. This theory however is very limited due to the fact that the uniform distribution and the corresponding Fourier basis are rarely encountered as a statistical model. A family of distributions that vastly generalizes the uniform distribution on the Boolean cube is that of distributions represented by Markov Random Fields (MRF). Markov Random Fields are one of the main tools for modeling high dimensional data in many areas of statistics and machine learning. In this paper we initiate the investigation of extending central ideas, methods and algorithms from the theory of learning under the uniform distribution to the setup of learning concepts given examples from MRF distributions. In particular, our results establish a novel connection between properties of MCMC sampling of MRFs and learning under the MRF distribution. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Kanade15.html
http://proceedings.mlr.press/v40/Kanade15.htmlOn Learning Distributions from their SamplesOne of the most natural and important questions in statistical learning is: how well can a distribution be approximated from its samples. Surprisingly, this question has so far been resolved for only one loss, the KL-divergence and even in this case, the estimator used is ad hoc and not well understood. We study distribution approximations for general loss measures. For \ell_2^2 we determine the best approximation possible, for \ell_1 and χ^2 we derive tight bounds on the best approximation, and when the probabilities are bounded away from zero, we resolve the question for all sufficiently smooth loss measures, thereby providing a coherent understanding of the rate at which distributions can be approximated from their samples.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Kamath15.html
http://proceedings.mlr.press/v40/Kamath15.htmlExp-Concavity of Proper Composite LossesThe goal of online prediction with expert advice is to find a decision strategy which will perform almost as well as the best expert in a given pool of experts, on any sequence of outcomes. This problem has been widely studied and O(\sqrtT) and O(\logT) regret bounds can be achieved for convex losses and strictly convex losses with bounded first and second derivatives respectively. In special cases like the Aggregating Algorithm with mixable losses and the Weighted Average Algorithm with exp-concave losses, it is possible to achieve O(1) regret bounds. But mixability and exp-concavity are roughly equivalent under certain conditions. Thus by understanding the underlying relationship between these two notions we can gain the best of both algorithms (strong theoretical performance guarantees of the Aggregating Algorithm and the computational efficiency of the Weighted Average Algorithm). In this paper we provide a complete characterization of the exp-concavity of any proper composite loss. Using this characterization and the mixability condition of proper losses, we show that it is possible to transform (re-parameterize) any β-mixable binary proper loss into a β-exp-concave composite loss with the same β. In the multi-class case, we propose an approximation approach for this transformation.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Kamalaruban15.html
http://proceedings.mlr.press/v40/Kamalaruban15.htmlFast Exact Matrix Completion with Finite SamplesMatrix completion is the problem of recovering a low rank matrix by observing a small fraction of its entries. A series of recent works (Keshavan 2012),(Jain et al. 2013) and (Hardt, 2013) have proposed fast non-convex optimization based iterative algorithms to solve this problem. However, the sample complexity in all these results is sub-optimal in its dependence on the rank, condition number and the desired accuracy. In this paper, we present a fast iterative algorithm that solves the matrix completion problem by observing O\left(nr^5 \log^3 n\right) entries, which is independent of the condition number and the desired accuracy. The run time of our algorithm is O\left( nr^7\log^3 n\log 1/ε\right) which is near linear in the dimension of the matrix. To the best of our knowledge, this is the first near linear time algorithm for exact matrix completion with finite sample complexity (i.e. independent of ε). Our algorithm is based on a well known projected gradient descent method, where the projection is onto the (non-convex) set of low rank matrices. There are two key ideas in our result: 1) our argument is based on a \ell_∞norm potential function (as opposed to the spectral norm) and provides a novel way to obtain perturbation bounds for it. 2) we prove and use a natural extension of the Davis-Kahan theorem to obtain perturbation bounds on the best low rank approximation of matrices with good eigen gap. Both of these ideas may be of independent interest. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Jain15.html
http://proceedings.mlr.press/v40/Jain15.htmlTensor principal component analysis via sum-of-square proofsWe study a statistical model for the \emphtensor principal component analysis problem introduced by Montanari and Richard: Given a order-3 tensor \mathbf T of the form \mathbf T = τ⋅v_0^⊗3 + \mathbf A, where τ≥0 is a signal-to-noise ratio, v_0 is a unit vector, and \mathbf A is a random noise tensor, the goal is to recover the planted vector v_0. For the case that \mathbf A has iid standard Gaussian entries, we give an efficient algorithm to recover v_0 whenever τ≥ω(n^3/4 \log(n)^1/4), and certify that the recovered vector is close to a maximum likelihood estimator, all with high probability over the random choice of \mathbf A. The previous best algorithms with provable guarantees required τ≥Ω(n). In the regime τ≤o(n), natural tensor-unfolding-based spectral relaxations for the underlying optimization problem break down. To go beyond this barrier, we use convex relaxations based on the sum-of-squares method. Our recovery algorithm proceeds by rounding a degree-4 sum-of-squares relaxations of the maximum-likelihood-estimation problem for the statistical model. To complement our algorithmic results, we show that degree-4 sum-of-squares relaxations break down for τ≤O(n^3/4/\log(n)^1/4), which demonstrates that improving our current guarantees (by more than logarithmic factors) would require new techniques or might even be intractable. Finally, we show how to exploit additional problem structure in order to solve our sum-of-squares relaxations, up to some approximation, very efficiently. Our fastest algorithm runs in nearly-linear time using shifted (matrix) power iteration and has similar guarantees as above. The analysis of this algorithm also confirms a variant of a conjecture of Montanari and Richard about singular vectors of tensor unfoldings.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Hopkins15.html
http://proceedings.mlr.press/v40/Hopkins15.htmlAdaptive Recovery of Signals by Convex OptimizationWe present a theoretical framework for adaptive estimation and prediction of signals of unknown structure in the presence of noise. The framework allows to address two intertwined challenges: (i) designing optimal statistical estimators; (ii) designing efficient numerical algorithms. In particular, we establish oracle inequalities for the performance of adaptive procedures, which rely upon convex optimization and thus can be efficiently implemented. As an application of the proposed approach, we consider denoising of harmonic oscillations.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Harchaoui15.html
http://proceedings.mlr.press/v40/Harchaoui15.htmlComputational Lower Bounds for Community Detection on Random GraphsThis paper studies the problem of detecting the presence of a small dense community planted in a large Erdős-Rényi random graph \calG(N,q), where the edge probability within the community exceeds q by a constant factor. Assuming the hardness of the planted clique detection problem, we show that the computational complexity of detecting the community exhibits the following phase transition phenomenon: As the graph size N grows and the graph becomes sparser according to q=N^-α, there exists a critical value of α= \frac23, below which there exists a computationally intensive procedure that can detect far smaller communities than any computationally efficient procedure, and above which a linear-time procedure is statistically optimal. The results also lead to the average-case hardness results for recovering the dense community and approximating the densest K-subgraph. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Hajek15.html
http://proceedings.mlr.press/v40/Hajek15.htmlOpen Problem: The Oracle Complexity of Smooth Convex Optimization in Nonstandard SettingsFirst-order convex minimization algorithms are currently the methods of choice for large-scale sparse – and more generally parsimonious – regression models. We pose the question on the limits of performance of black-box oriented methods for convex minimization in \em non-standard settings, where the regularity of the objective is measured in a norm not necessarily induced by the feasible domain. This question is studied for \ell_p/\ell_q-settings, and their matrix analogues (Schatten norms), where we find surprising gaps on lower bounds compared to state of the art methods. We propose a conjecture on the optimal convergence rates for these settings, for which a positive answer would lead to significant improvements on minimization algorithms for parsimonious regression models.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Guzman15.html
http://proceedings.mlr.press/v40/Guzman15.htmlConference on Learning Theory 2015: PrefacePreface to COLT 2015Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Grunwald15.html
http://proceedings.mlr.press/v40/Grunwald15.htmlThompson Sampling for Learning Parameterized Markov Decision ProcessesWe consider reinforcement learning in parameterized Markov Decision Processes (MDPs), where the parameterization may induce correlation across transition probabilities or rewards. Consequently, observing a particular state transition might yield useful information about other, unobserved, parts of the MDP. We present a version of Thompson sampling for parameterized reinforcement learning problems, and derive a frequentist regret bound for priors over general parameter spaces. The result shows that the number of instants where suboptimal actions are chosen scales logarithmically with time, with high probability. It holds for prior distributions that put significant probability near the true model, without any additional, specific closed-form structure such as conjugate or product-form priors. The constant factor in the logarithmic scaling encodes the information complexity of learning the MDP in terms of the Kullback-Leibler geometry of the parameter space.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Gopalan15.html
http://proceedings.mlr.press/v40/Gopalan15.htmlLearning the dependence structure of rare events: a non-asymptotic studyAssessing the probability of occurrence of extreme events is a crucial issue in various fields like finance, insurance, telecommunication or environmental sciences. In a multivariate framework, the tail dependence is characterized by the so-called \emphstable tail dependence function (\textscstdf). Learning this structure is the keystone of multivariate extremes. Although extensive studies have proved consistency and asymptotic normality for the empirical version of the \textscstdf, non-asymptotic bounds are still missing. The main purpose of this paper is to fill this gap. Taking advantage of adapted VC-type concentration inequalities, upper bounds are derived with expected rate of convergence in O(k^-1/2). The concentration tools involved in this analysis rely on a more general study of maximal deviations in low probability regions, and thus directly apply to the classification of extreme data. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Goix15.html
http://proceedings.mlr.press/v40/Goix15.htmlEscaping From Saddle Points — Online Stochastic Gradient for Tensor DecompositionWe analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in \em saddle points. In this paper we identify \em strict saddle property for non-convex problem that allows for efficient optimization. Using this property we show that from an \em arbitrary starting point, stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives \em global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points. Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Ge15.html
http://proceedings.mlr.press/v40/Ge15.htmlA Chaining Algorithm for Online Nonparametric RegressionWe consider the problem of online nonparametric regression with arbitrary deterministic sequences. Using ideas from the chaining technique, we design an algorithm that achieves a Dudley-type regret bound similar to the one obtained in a non-constructive fashion by Rakhlin and Sridharan (2014). Our regret bound is expressed in terms of the metric entropy in the sup norm, which yields optimal guarantees when the metric and sequential entropies are of the same order of magnitude. In particular our algorithm is the first one that achieves optimal rates for online regression over Hölder balls. In addition we show for this example how to adapt our chaining algorithm to get a reasonable computational efficiency with similar regret guarantees (up to a log factor).Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Gaillard15.html
http://proceedings.mlr.press/v40/Gaillard15.htmlCompeting with the Empirical Risk Minimizer in a Single PassIn many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data – commonly referred to as either the empirical risk minimizer (ERM) or the M-estimator – is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this work is to perform as well as the ERM, on \emphevery problem, while minimizing the use of computational resources such as running time and space usage. We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: \beginenumerate \item The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. \item The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem, even considering constant factors. \item The algorithm’s performance depends on the initial error at a rate that decreases super-polynomially. \item The algorithm is easily parallelizable. \endenumerate Moreover, we quantify the (finite-sample) rate at which the algorithm becomes competitive with the ERM.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Frostig15.html
http://proceedings.mlr.press/v40/Frostig15.htmlVector-Valued Property ElicitationThe elicitation of a statistic, or property of a distribution, is the task of devising proper scoring rules, equivalently proper losses, which incentivize an agent or algorithm to truthfully estimate the desired property of the underlying probability distribution or data set. Leveraging connections between elicitation and convex analysis, we address the vector-valued property case, which has received little attention in the literature despite its applications to both machine learning and statistics. We first provide a very general characterization of linear and ratio-of-linear properties, the first of which resolves an open problem by unifying and strengthening several previous characterizations in machine learning and statistics. We then ask which vectors of properties admit nonseparable scores, which cannot be expressed as a sum of scores for each coordinate separately, a natural desideratum for machine learning. We show that linear and ratio-of-linear do admit nonseparable scores, and provide evidence for a conjecture that these are the only such properties (up to link functions). Finally, we give a general method for producing identification functions and address an open problem by showing that convex maximal level sets are insufficient for elicitability in general.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Frongillo15.html
http://proceedings.mlr.press/v40/Frongillo15.htmlVariable Selection is HardVariable selection for sparse linear regression is the problem of finding, given an m\times p matrix B and a target vector \bfy, a sparse vector \bfx such that B\bfx approximately equals \bfy. Assuming a standard complexity hypothesis, we show that no polynomial-time algorithm can find a k’-sparse \bfx with \|B\bfx-\bfy\|^2\le h(m,p), where k’=k⋅2^\log ^1-δ p and h(m,p)= p^C_1 m^1-C_2, where δ>0,C_1>0,C_2>0 are arbitrary. This is true even under the promise that there is an unknown k-sparse vector \bfx^* satisfying B\bfx^*=\bfy. We prove a similar result for a statistical version of the problem in which the data are corrupted by noise. To the authors’ knowledge, these are the first hardness results for sparse regression that apply when the algorithm simultaneously has k’>k and h(m,p)>0.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Foster15.html
http://proceedings.mlr.press/v40/Foster15.htmlFrom Averaging to Acceleration, There is Only a Step-sizeWe show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for quadratic non-strongly-convex problems may be reformulated as constant parameter second-order difference equation algorithms, where stability of the system is equivalent to convergence at rate O(1/n^2), where n is the number of iterations. We provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system, showing various oscillatory and non-oscillatory behaviors, together with a sharp stability result with explicit constants. We also consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Flammarion15.html
http://proceedings.mlr.press/v40/Flammarion15.htmlLearning and inference in the presence of corrupted inputsWe consider a model where given an uncorrupted input an adversary can corrupt it to one out of m corrupted inputs. We model the classification and inference problems as a zero-sum game between a learner, minimizing the expected error, and an adversary, maximizing the expected error. The value of this game is the optimal error rate achievable. For learning using a limited hypothesis class \mathcalH over corrupted inputs, we give an efficient algorithm that given an uncorrupted sample returns a hypothesis h∈\mathcalH whose error on adversarially corrupted inputs is near optimal. Our algorithm uses as a blackbox an oracle that solves the ERM problem for the hypothesis class \mathcalH. We provide a generalization bound for our setting, showing that for a sufficiently large sample, the performance on the sample and future unseen corrupted inputs will be similar. This gives an efficient learning algorithm for our adversarial setting, based on an ERM oracle. We also consider an inference related setting of the problem, where given a corrupted input, the learner queries the target function on various uncorrupted inputs and generates a prediction regarding the given corrupted input. There is no limitation on the prediction function the learner may generate, so implicitly the hypothesis class includes all possible hypotheses. In this setting we characterize the optimal learner policy as a minimum vertex cover in a given bipartite graph, and the optimal adversary policy as a maximum matching in the same bipartite graph. We design efficient local algorithms for approximating minimum vertex cover in bipartite graphs, which implies an efficient near optimal algorithm for the learner.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Feige15.html
http://proceedings.mlr.press/v40/Feige15.htmlFaster Algorithms for Testing under Conditional SamplingThere has been considerable recent interest in distribution-tests whose run-time and sample requirements are sublinear in the domain-size k. We study two of the most important tests under the conditional-sampling model where each query specifies a subset S of the domain, and the response is a sample drawn from S according to the underlying distribution. For identity testing, which asks whether the underlying distribution equals a specific given distribution or ε-differs from it, we reduce the known time and sample complexities from \widetilde\mathcalO(ε^-4) to \widetilde\mathcalO(ε^-2), thereby matching the information theoretic lower bound. For closeness testing, which asks whether two distributions underlying observed data sets are equal or different, we reduce existing complexity from \widetilde\mathcalO(ε^-4 \log^5 k) to an even sub-logarithmic \widetilde\mathcalO(ε^-5 \log \log k) thus providing a better bound to an open problem in Bertinoro Workshop on Sublinear Algorithms (Fisher, 2014).Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Falahatgar15.html
http://proceedings.mlr.press/v40/Falahatgar15.htmlBeyond Hartigan Consistency: Merge Distortion Metric for Hierarchical ClusteringHierarchical clustering is a popular method for analyzing data which associates a tree to a dataset. Hartigan consistency has been used extensively as a framework to analyze such clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree which is Hartigan consistent with a given density can look very different than the correct limit tree. Specifically, Hartigan consistency permits two types of undesirable configurations which we term \emphover-segmentation and \emphimproper nesting. Moreover, Hartigan consistency is a limit property and does not directly quantify difference between trees. In this paper we identify two limit properties, \emphseparation and \emphminimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency. We proceed to introduce a \emphmerge distortion metric between hierarchical clusterings and show that convergence in our distance implies both separation and minimality. We also prove that uniform separation and minimality imply convergence in the merge distortion metric. Furthermore, we show that our merge distortion metric is stable under perturbations of the density. Finally, we demonstrate applicability of these concepts by proving convergence results for two clustering algorithms. First, we show convergence (and hence separation and minimality) of the recent robust single linkage algorithm of Chaudhuri and Dasgupta (2010). Second, we provide convergence results on manifolds for topological split tree clustering.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Eldridge15.html
http://proceedings.mlr.press/v40/Eldridge15.htmlContextual Dueling BanditsWe consider the problem of learning to choose actions using contextual information when provided with limited feedback in the form of relative pairwise comparisons. We study this problem in the dueling-bandits framework of Yue et al. (COLT’09), which we extend to incorporate context. Roughly, the learner’s goal is to find the best policy, or way of behaving, in some space of policies, although “best” is not always so clearly defined. Here, we propose a new and natural solution concept, rooted in game theory, called a \emphvon Neumann winner, a randomized policy that beats or ties every other policy. We show that this notion overcomes important limitations of existing solutions, particularly the Condorcet winner which has typically been used in the past, but which requires strong and often unrealistic assumptions. We then present three \emphefficient algorithms for online learning in our setting, and for approximating a von Neumann winner from batch-like data. The first of these algorithms achieves particularly low regret, even when data is adversarial, although its time and space requirements are linear in the size of the policy space. The other two algorithms require time and space only logarithmic in the size of the policy space when provided access to an oracle for solving classification problems on the space.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Dudik15.html
http://proceedings.mlr.press/v40/Dudik15.htmlImproved Sum-of-Squares Lower Bounds for Hidden Clique and Hidden Submatrix ProblemsGiven a large data matrix A∈\mathbbR^n\times n, we consider the problem of determining whether its entries are i.i.d. from some known marginal distribution A_ij∼P_0, or instead A contains a principal submatrix A_\sf Q,\sf Q whose entries have marginal distribution A_ij∼P_1≠P_0. As a special case, the hidden (or planted) clique problem is finding a planted clique in an otherwise uniformly random graph. Assuming unbounded computational resources, this hypothesis testing problem is statistically solvable provided |\sf Q|\ge C \log n for a suitable constant C. However, despite substantial effort, no polynomial time algorithm is known that succeeds with high probability when |\sf Q| = o(\sqrtn). Recently, \citemeka2013association proposed a method to establish lower bounds for the hidden clique problem within the Sum of Squares (SOS) semidefinite hierarchy. Here we consider the degree-4 SOS relaxation, and study the construction of \citemeka2013association to prove that SOS fails unless k\ge C\,n^1/3/\log n. An argument presented by \citeBarakLectureNotes implies that this lower bound cannot be substantially improved unless the witness construction is changed in the proof. Our proof uses the moment method to bound the spectrum of a certain random association scheme, i.e. a symmetric random matrix whose rows and columns are indexed by the edges of an Erdös-Renyi random graph. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Deshpande15.html
http://proceedings.mlr.press/v40/Deshpande15.htmlS2: An Efficient Graph Based Active Learning Algorithm with Application to Nonparametric ClassificationThis paper investigates the problem of active learning for binary label prediction on a graph. We introduce a simple and label-efficient algorithm called S^2 for this task. At each step, S^2 selects the vertex to be labeled based on the structure of the graph and all previously gathered labels. Specifically, S^2 queries for the label of the vertex that bisects the \em shortest shortest path between any pair of oppositely labeled vertices. We present a theoretical estimate of the number of queries S^2 needs in terms of a novel parametrization of the complexity of binary functions on graphs. We also present experimental results demonstrating the performance of S^2 on both real and synthetic data. While other graph-based active learning algorithms have shown promise in practice, our algorithm is the first with both good performance and theoretical guarantees. Finally, we demonstrate the implications of the S^2 algorithm to the theory of nonparametric active learning. In particular, we show that S^2 achieves near minimax optimal excess risk for an important class of nonparametric classification problems.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Dasarathy15.html
http://proceedings.mlr.press/v40/Dasarathy15.htmlA PTAS for Agnostically Learning HalfspacesWe present a PTAS for agnostically learning halfspaces w.r.t. the uniform distribution on the d dimensional sphere. Namely, we show that for every μ>0 there is an algorithm that runs in time \mathrmpoly\left(d,\frac1ε\right), and is guaranteed to return a classifier with error at most (1+μ)\mathrmopt+ε, where \mathrmopt is the error of the best halfspace classifier. This improves on Awasthi, Balcan and Long (STOC 2014) who showed an algorithm with an (unspecified) constant approximation ratio. Our algorithm combines the classical technique of polynomial regression, together with the new localization technique of Awasthi et. al.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Daniely15.html
http://proceedings.mlr.press/v40/Daniely15.htmlTruthful Linear RegressionWe consider the problem of fitting a linear model to data held by individuals who are concerned about their privacy. Incentivizing most players to truthfully report their data to the analyst constrains our design to mechanisms that provide a privacy guarantee to the participants; we use differential privacy to model individuals’ privacy losses. This immediately poses a problem, as differentially private computation of a linear model necessarily produces a biased estimation, and existing approaches to design mechanisms to elicit data from privacy-sensitive individuals do not generalize well to biased estimators. We overcome this challenge through an appropriate design of the computation and payment scheme.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Cummings15.html
http://proceedings.mlr.press/v40/Cummings15.htmlOn-Line Learning Algorithms for Path Experts with Non-Additive LossesWe consider two broad families of non-additive loss functions covering a large number of applications: rational losses and tropical losses. We give new algorithms extending the Follow-the-Perturbed-Leader (FPL) algorithm to both of these families of loss functions and similarly give new algorithms extending the Randomized Weighted Majority (RWM) algorithm to both of these families. We prove that the time complexity of our extensions to rational losses of both FPL and RWM is polynomial and present regret bounds for both. We further show that these algorithms can play a critical role in improving performance in applications such as structured prediction. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Cortes15.html
http://proceedings.mlr.press/v40/Cortes15.htmlOpen Problem: The landscape of the loss surfaces of multilayer networksDeep learning has enjoyed a resurgence of interest in the last few years for such applications as image and speech recognition, or natural language processing. The vast majority of practical applications of deep learning focus on supervised learning, where the supervised loss function is minimized using stochastic gradient descent. The properties of this highly non-convex loss function, such as its landscape and the behavior of critical points (maxima, minima, and saddle points), as well as the reason why large- and small-size networks achieve radically different practical performance, are however very poorly understood. It was only recently shown that new results in spin-glass theory potentially may provide an explanation for these problems by establishing a connection between the loss function of the neural networks and the Hamiltonian of the spherical spin-glass models. The connection between both models relies on a number of possibly unrealistic assumptions, yet the empirical evidence suggests that the connection may exist in real. The question we pose is whether it is possible to drop some of these assumptions to establish a stronger connection between both models.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Choromanska15.html
http://proceedings.mlr.press/v40/Choromanska15.htmlStochastic Block Model and Community Detection in Sparse Graphs: A spectral algorithm with optimal rate of recoveryIn this paper, we present and analyze a simple and robust spectral algorithm for the stochastic block model with k blocks, for any k fixed. Our algorithm works with graphs having constant edge density, under an optimal condition on the gap between the density inside a block and the density between the blocks. As a co-product, we settle an open question posed by Abbe et. al. concerning censor block models.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Chin15.html
http://proceedings.mlr.press/v40/Chin15.htmlEfficient Sampling for Gaussian Graphical Models via Spectral SparsificationMotivated by a sampling problem basic to computational statistical inference, we develop a toolset based on spectral sparsification for a family of fundamental problems involving Gaussian sampling, matrix functionals, and reversible Markov chains. Drawing on the connection between Gaussian graphical models and the recent breakthroughs in spectral graph theory, we give the first nearly linear time algorithm for the following basic matrix problem: Given an n\times n Laplacian matrix \mathbfM and a constant -1 ≤p ≤1, provide efficient access to a sparse n\times n linear operator \tilde\mathbfC such that $\mathbfM^p ≈\tilde\mathbfC \tilde\mathbfC^⊤, where ≈denotes spectral similarity. When p is set to -1, this gives the first parallel sampling algorithm that is essentially optimal both in total work and randomness for Gaussian random fields with symmetric diagonally dominant (SDD) precision matrices. It only requires \em nearly linear work and 2n \em i.i.d. random univariate Gaussian samples to generate an n-dimensional \em i.i.d. Gaussian random sample in polylogarithmic depth. The key ingredient of our approach is an integration of spectral sparsification with multilevel method: Our algorithms are based on factoring \mathbfM^p$ into a product of well-conditioned matrices, then introducing powers and replacing dense matrices with sparse approximations. We give two sparsification methods for this approach that may be of independent interest. The first invokes Maclaurin series on the factors, while the second builds on our new nearly linear time spectral sparsification algorithm for random-walk matrix polynomials. We expect these algorithmic advances will also help to strengthen the connection between machine learning and spectral graph theory, two of the most active fields in understanding large data and networks. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Cheng15.html
http://proceedings.mlr.press/v40/Cheng15.htmlSequential Information Maximization: When is Greedy Near-optimal?Optimal information gathering is a central challenge in machine learning and science in general. A common objective that quantifies the usefulness of observations is Shannon’s mutual information, defined w.r.t. a probabilistic model. Greedily selecting observations that maximize the mutual information is the method of choice in numerous applications, ranging from Bayesian experimental design to automated diagnosis, to active learning in Bayesian models. Despite its importance and widespread use in applications, little is known about the theoretical properties of sequential information maximization, in particular under noisy observations. In this paper, we analyze the widely used greedy policy for this task, and identify problem instances where it provides provably near-maximal utility, even in the challenging setting of persistent noise. Our results depend on a natural separability condition associated with a channel injecting noise into the observations. We also identify examples where this separability parameter is necessary in the bound: if it is too small, then the greedy policy fails to select informative tests.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Chen15b.html
http://proceedings.mlr.press/v40/Chen15b.htmlLearnability of Solutions to Conjunctive Queries: The Full DichotomyThe problem of learning the solution space of an unknown formula has been studied in multiple embodiments in computational learning theory. In this article, we study a family of such learning problems; this family contains, for each relational structure, the problem of learning the solution space of an unknown conjunctive query evaluated on the structure. A progression of results aimed to classify the learnability of each of the problems in this family, and thus far a culmination thereof was a positive learnability result generalizing all previous ones. This article completes the classification program towards which this progression of results strived, by presenting a negative learnability result that complements the mentioned positive learnability result. In order to obtain our negative result, we make use of universal-algebraic concepts, and our result is phrased in terms of the varietal property of non-congruence modularity.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Chen15a.html
http://proceedings.mlr.press/v40/Chen15a.htmlOn the Complexity of Learning with KernelsA well-recognized limitation of kernel learning is the requirement to handle a kernel matrix, whose size is quadratic in the number of training examples. Many methods have been proposed to reduce this computational cost, mostly by using a subset of the kernel matrix entries, or some form of low-rank matrix approximation, or a random projection method. In this paper, we study lower bounds on the error attainable by such methods as a function of the number of entries observed in the kernel matrix or the rank of an approximate kernel matrix. We show that there are kernel learning problems where no such method will lead to non-trivial computational savings. Our results also quantify how the problem difficulty depends on parameters such as the nature of the loss function, the regularization parameter, the norm of the desired predictor, and the kernel matrix rank. Our results also suggest cases where more efficient kernel learning might be possible.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Cesa-Bianchi15.html
http://proceedings.mlr.press/v40/Cesa-Bianchi15.htmlOptimum Statistical Estimation with Strategic Data SourcesWe propose an optimum mechanism for providing monetary incentives to the data sources of a statistical estimator such as linear regression, so that high quality data is provided at low cost, in the sense that the weighted sum of payments and estimation error is minimized. The mechanism applies to a broad range of estimators, including linear and polynomial regression, kernel regression, and, under some additional assumptions, ridge regression. It also generalizes to several objectives, including minimizing estimation error subject to budget constraints. Besides our concrete results for regression problems, we contribute a mechanism design framework through which to design and analyze statistical estimators whose examples are supplied by workers with cost for labeling said examples.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Cai15.html
http://proceedings.mlr.press/v40/Cai15.htmlThe entropic barrier: a simple and optimal universal self-concordant barrierWe prove that the Fenchel dual of the log-Laplace transform of the uniform measure on a convex body in \mathbbR^n is a (1+o(1)) n-self-concordant barrier, improving a seminal result of Nesterov and Nemirovski. This gives the first explicit construction of a universal barrier for convex bodies with optimal self-concordance parameter. The proof is based on basic geometry of log-concave distributions, and elementary duality in exponential families. The result also gives a new perspective on the minimax regret for the linear bandit problem.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Bubeck15b.html
http://proceedings.mlr.press/v40/Bubeck15b.htmlBandit Convex Optimization: \sqrtT Regret in One DimensionWe analyze the minimax regret of the adversarial bandit convex optimization problem. Focusing on the one-dimensional case, we prove that the minimax regret is \widetildeΘ(\sqrtT) and partially resolve a decade-old open problem. Our analysis is non-constructive, as we do not present a concrete algorithm that attains this regret rate. Instead, we use minimax duality to reduce the problem to a Bayesian setting, where the convex loss functions are drawn from a worst-case distribution, and then we solve the Bayesian version of the problem with a variant of Thompson Sampling. Our analysis features a novel use of convexity, formalized as a “local-to-global” property of convex functions, that may be of independent interest. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Bubeck15a.html
http://proceedings.mlr.press/v40/Bubeck15a.htmlEscaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex FunctionsWe consider the problem of optimizing an approximately convex function over a bounded convex set in \mathbbR^n using only function evaluations. The problem is reduced to sampling from an \emphapproximately log-concave distribution using the Hit-and-Run method, which is shown to have the same \mathcalO^* complexity as sampling from log-concave distributions. In addition to extend the analysis for log-concave distributions to approximate log-concave distributions, the implementation of the 1-dimensional sampler of the Hit-and-Run walk requires new methods and analysis. The algorithm then is based on simulated annealing which does not relies on first order conditions which makes it essentially immune to local minima. We then apply the method to different motivating problems. In the context of zeroth order stochastic convex optimization, the proposed method produces an ε-minimizer after \mathcalO^*(n^7.5ε^-2) noisy function evaluations by inducing a \mathcalO(ε/n)-approximately log concave distribution. We also consider in detail the case when the “amount of non-convexity” decays towards the optimum of the function. Other applications of the method discussed in this work include private computation of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic programming for online learning.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Belloni15.html
http://proceedings.mlr.press/v40/Belloni15.htmlMinimax Fixed-Design Linear RegressionWe consider a linear regression game in which the covariates are known in advance: at each round, the learner predicts a real-value, the adversary reveals a label, and the learner incurs a squared error loss. The aim is to minimize the regret with respect to linear predictions. For a variety of constraints on the adversary’s labels, we show that the minimax optimal strategy is linear, with a parameter choice that is reminiscent of ordinary least squares (and as easy to compute). The predictions depend on all covariates, past and future, with a particular weighting assigned to future covariates corresponding to the role that they play in the minimax regret. We study two families of label sequences: box constraints (under a covariate compatibility condition), and a weighted 2-norm constraint that emerges naturally from the analysis. The strategy is adaptive in the sense that it requires no knowledge of the constraint set. We obtain an explicit expression for the minimax regret for these games. For the case of uniform box constraints, we show that, with worst case covariate sequences, the regret is O(d\log T), with no dependence on the scaling of the covariates. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Bartlett15.html
http://proceedings.mlr.press/v40/Bartlett15.htmlOpen Problem: Restricted Eigenvalue Condition for Heavy Tailed DesignsThe restricted eigenvalue (RE) condition characterizes the sample complexity of accurate recovery in the context of high-dimensional estimators such as Lasso and Dantzig selector (Bickel et al., 2009). Recent work has shown that random design matrices drawn from any thin-tailed (sub-Gaussian) distributions satisfy the RE condition with high probability, when the number of samples scale as the square of the Gaussian width of the restricted set (Banerjee et al., 2014; Tropp, 2015). We pose the equivalent question for heavy-tailed distributions: Given a random design matrix drawn from a heavy-tailed distribution satisfying the smallball property (Mendelson, 2015), does the design matrix satisfy the RE condition with the same order of sample complexity as sub-Gaussian distributions? An answer to the question will guide the design of highdimensional estimators for heavy tailed problems.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Banerjee15.html
http://proceedings.mlr.press/v40/Banerjee15.htmlOptimally Combining Classifiers Using Unlabeled DataWe develop a worst-case analysis of aggregation of classifier ensembles for binary classification. The task of predicting to minimize error is formulated as a game played over a given set of unlabeled data (a transductive setting), where prior label information is encoded as constraints on the game. The minimax solution of this game identifies cases where a weighted combination of the classifiers can perform significantly better than any single classifier.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Balsubramani15.html
http://proceedings.mlr.press/v40/Balsubramani15.htmlEfficient Representations for Lifelong Learning and AutoencodingIt has been a long-standing goal in machine learning, as well as in AI more generally, to develop life-long learning systems that learn many different tasks over time, and reuse insights from tasks learned, “learning to learn” as they do so. In this work we pose and provide efficient algorithms for several natural theoretical formulations of this goal. Specifically, we consider the problem of learning many different target functions over time, that share certain commonalities that are initially unknown to the learning algorithm. Our aim is to learn new internal representations as the algorithm learns new target functions, that capture this commonality and allow subsequent learning tasks to be solved more efficiently and from less data. We develop efficient algorithms for two very different kinds of commonalities that target functions might share: one based on learning common low-dimensional and unions of low-dimensional subspaces and one based on learning nonlinear Boolean combinations of features. Our algorithms for learning Boolean feature combinations additionally have a dual interpretation, and can be viewed as giving an efficient procedure for constructing near-optimal sparse Boolean autoencoders under a natural “anchor-set” assumption.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Balcan15.html
http://proceedings.mlr.press/v40/Balcan15.htmlEfficient Learning of Linear Separators under Bounded NoiseWe study the learnability of linear separators in \Re^d in the presence of bounded (a.k.a Massart) noise. This is a realistic generalization of the random classification noise model, where the adversary can flip each example x with probability η(x) ≤η. We provide the first polynomial time algorithm that can learn linear separators to arbitrarily small excess error in this noise model under the uniform distribution over the unit sphere in \Re^d, for some constant value of η. While widely studied in the statistical learning theory community in the context of getting faster convergence rates, computationally efficient algorithms in this model had remained elusive. Our work provides the first evidence that one can indeed design algorithms achieving arbitrarily small excess error in polynomial time under this realistic noise model and thus opens up a new and exciting line of research. We additionally provide lower bounds showing that popular algorithms such as hinge loss minimization and averaging cannot lead to arbitrarily small excess error under Massart noise, even under the uniform distribution. Our work, instead, makes use of a margin based technique developed in the context of active learning. As a result, our algorithm is also an active learning algorithm with label complexity that is only logarithmic in the desired excess error ε. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Awasthi15b.html
http://proceedings.mlr.press/v40/Awasthi15b.htmlLabel optimal regret bounds for online local learningWe resolve an open question from Christiano (2014b) posed in COLT’14 regarding the optimal dependency of the regret achievable for online local learning on the size of the label set. In this framework, the algorithm is shown a pair of items at each step, chosen from a set of n items. The learner then predicts a label for each item, from a label set of size L and receives a real valued payoff. This is a natural framework which captures many interesting scenarios such as online gambling and online max cut. Christiano (2014a) designed an efficient online learning algorithm for this problem achieving a regret of O(\sqrtnL^3 T), where T is the number of rounds. Information theoretically, one can achieve a regret of O(\sqrtn \log L T). One of the main open questions left in this framework concerns closing the above gap. In this work, we provide a complete answer to the question above via two main results. We show, via a tighter analysis, that the semi-definite programming based algorithm of Christiano (2014a) in fact achieves a regret of O(\sqrtnLT). Second, we show a matching computational lower bound. Namely, we show that a polynomial time algorithm for online local learning with lower regret would imply a polynomial time algorithm for the planted clique problem which is widely believed to be hard. We prove a similar hardness result under a related conjecture concerning planted dense subgraphs that we put forth. Unlike planted clique, the planted dense subgraph problem does not have any known quasi-polynomial time algorithms. Computational lower bounds for online learning are relatively rare, and we hope that the ideas developed in this work will lead to lower bounds for other online learning scenarios as well.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Awasthi15a.html
http://proceedings.mlr.press/v40/Awasthi15a.htmlSimple, Efficient, and Neural Algorithms for Sparse CodingSparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Recent work has resulted in several algorithms for sparse coding with provable guarantees, but somewhat surprisingly these are outperformed by the simple alternating minimization heuristics. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees. Some of these algorithms seem implementable on simple neural architectures, which was the original motivation of Olshausen and Field in introducing sparse coding. We also give the first efficient algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our analysis framework will have applications in other settings where simple iterative algorithms are usedFri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Arora15.html
http://proceedings.mlr.press/v40/Arora15.htmlLearning Overcomplete Latent Variable Models through Tensor MethodsWe provide guarantees for learning latent variable models emphasizing on the overcomplete regime, where the dimensionality of the latent space exceeds the observed dimensionality. In particular, we consider multiview mixtures, ICA, and sparse coding models. Our main tool is a new algorithm for tensor decomposition that works in the overcomplete regime. In the semi-supervised setting, we exploit label information to get a rough estimate of the model parameters, and then refine it using the tensor method on unlabeled samples. We establish learning guarantees when the number of components scales as k=o(d^p/2), where d is the observed dimension, and p is the order of the observed moment employed in the tensor method (usually p=3,4). In the unsupervised setting, a simple initialization algorithm based on SVD of the tensor slices is proposed, and the guarantees are provided under the stricter condition that k ≤βd (where constant βcan be larger than 1). For the learning applications, we provide tight sample complexity bounds through novel covering arguments. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Anandkumar15.html
http://proceedings.mlr.press/v40/Anandkumar15.htmlOnline Learning with Feedback Graphs: Beyond BanditsWe study a general class of online learning problems where the feedback is specified by a graph. This class includes online prediction with expert advice and the multi-armed bandit problem, but also several learning problems where the online player does not necessarily observe his own loss. We analyze how the structure of the feedback graph controls the inherent difficulty of the induced T-round learning problem. Specifically, we show that any feedback graph belongs to one of three classes: \emphstrongly observable graphs, \emphweakly observable graphs, and \emphunobservable graphs. We prove that the first class induces learning problems with \widetildeΘ(α^1/2 T^1/2) minimax regret, where αis the independence number of the underlying graph; the second class induces problems with \widetildeΘ(δ^1/3T^2/3) minimax regret, where δis the domination number of a certain portion of the graph; and the third class induces problems with linear minimax regret. Our results subsume much of the previous work on learning with feedback graphs and reveal new connections to partial monitoring games. We also show how the regret is affected if the graphs are allowed to vary with time. Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Alon15.html
http://proceedings.mlr.press/v40/Alon15.htmlOn Consistent Surrogate Risk Minimization and Property ElicitationSurrogate risk minimization is a popular framework for supervised learning; property elicitation is a widely studied area in probability forecasting, machine learning, statistics and economics. In this paper, we connect these two themes by showing that calibrated surrogate losses in supervised learning can essentially be viewed as eliciting or estimating certain properties of the underlying conditional label distribution that are sufficient to construct an optimal classifier under the target loss of interest. Our study helps to shed light on the design of convex calibrated surrogates. We also give a new framework for designing convex calibrated surrogates under low-noise conditions by eliciting properties that allow one to construct ‘coarse’ estimates of the underlying distribution.Fri, 26 Jun 2015 00:00:00 +0000
http://proceedings.mlr.press/v40/Agarwal15.html
http://proceedings.mlr.press/v40/Agarwal15.html