Proceedings of Machine Learning ResearchProceedings of the 24th Annual Conference on Learning Theory
Held in Budapest, Hungary on 09-11 June 2011
Published as Volume 19 by the Proceedings of Machine Learning Research on 21 December 2011.
Volume Edited by:
Sham M. Kakade
Ulrike von Luxburg
Series Editors:
Neil D. Lawrence
http://proceedings.mlr.press/v19/
Mon, 29 May 2017 07:18:06 +0000Mon, 29 May 2017 07:18:06 +0000Jekyll v3.4.3Identifiability of Priors from Bounded Sample Sizes with Applications to Transfer LearningWe explore a transfer learning setting, in which a finite sequence of target concepts are sampled independently with an unknown distribution from a known family.We study the total number of labeled examples required to learn all targets to an arbitrary specified expected accuracy, focusing on the asymptotics in the number of tasks and the desired accuracy. Our primary interest is formally understanding the fundamental benefits of transfer learning, compared to learning each target independently from the others.Our approach to the transfer problem is general, in the sense that it can be used with a variety of learning protocols. The key insight driving our approach is that the distribution of the target concepts is identifiable from the joint distribution over a number of random labeled data points equal the Vapnik-Chervonenkis dimension of the concept space. This is not necessarily the case for the joint distribution over any smaller number of points.This work has particularly interesting implications when applied to active learning methods.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/yang11a.html
http://proceedings.mlr.press/v19/yang11a.htmlMixability is Bayes Risk Curvature Relative to Log LossMixability of a loss governs the best possible performance when aggregating expert predictions with respect to that loss. The determination of the mixability constant for binary losses is straightforward but opaque. In the binary case we make this transparent and simpler by characterising mixability in terms of the second derivative of the Bayes risk of proper losses. We then extend this result to multiclass proper losses where there are few existing results. We show that mixability is governed by the Hessian of the Bayes risk, relative to the Hessian of the Bayes risk for log loss. We conclude by comparing our result to other work that bounds prediction performance in terms of the geometry of the Bayes risk. Although all calculations are for proper losses, we also show how to carry the results across to improper losses.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/vanerven11a.html
http://proceedings.mlr.press/v19/vanerven11a.htmlThe Sample Complexity of Dictionary LearningA large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples?We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L_2 error in representation when the dictionary is used.For the case of l_1 regularized coefficient selection we provide a generalization bound of the order of O\left(\sqrtnp\ln(m λ)/m\right), where n is the dimension, p is the number of elements in the dictionary, λis a bound on the l_1 norm of the coefficient vector and m is the number of samples, which complements existing results.For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound ofthe order O(\sqrtnp\ln(m k)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function).We further show that this assumption holds for \em most dictionaries in high dimensions in a strong probabilistic sense.Our results also include bounds that converge as 1/m, not previously known for this problem.We provide similar results in a general setting using kernels with weak smoothness requirements.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/vainsencher11a.html
http://proceedings.mlr.press/v19/vainsencher11a.htmlAgnostic KWIK learning and efficient approximate reinforcement learningA popular approach in reinforcement learning is to use a model-based algorithm, i.e., an algorithm that utilizes a model learner to learn an approximate model to the environment.It has been shown that such a model-based learner is efficient if the model learner is efficient in the so-called “knows what it knows” (KWIK) framework.A major limitation of the standard KWIK framework is that, by its very definition, it covers only the case when the (model) learner can represent the actual environment with no errors.In this paper, we study the agnostic KWIK learning model, where we relax this assumption by allowing nonzero approximation errors. We show that with the new definition an efficient model learner still leads to an efficient reinforcement learning algorithm.At the same time, though, we find that learning within the new framework can be substantially slower as compared to the standard framework, even in the case of simple learning problems.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/szita11a.html
http://proceedings.mlr.press/v19/szita11a.htmlAdaptive Density Level Set ClusteringClusters are often defined to be the connected components of a density level set.Unfortunately, this definition depends on a level that needs to be user specifiedby some means. In this paper we present a simple algorithm that is able to asymptotically determinethe optimal level, that is, the level at which there is the first split in the cluster treeof the data generating distribution. We further show that this algorithm asymptotically recoversthe corresponding connected components. Unlike previous work, our analysis does not require strong assumptions on the density such as continuity or even smoothness.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/steinwart11a.html
http://proceedings.mlr.press/v19/steinwart11a.htmlMonotone multi-armed bandit allocationsWe present a novel angle for multi-armed bandits (henceforth abbreviated MAB) which follows from the recent work on \emphMAB mechanisms \citepMechMAB-ec09,DevanurK08,Transform-ec10. The new problem is, essentially, about designing MAB algorithms under an additional constraint motivated by their application to MAB mechanisms.This note is self-contained, although some familiarity with MAB is assumed; we refer the reader to \citeCesaBL-book for more background.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/slivkins11b.html
http://proceedings.mlr.press/v19/slivkins11b.htmlContextual Bandits with Similarity InformationIn a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now well-understood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms.We consider similarity information in the setting of \emphcontextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the \emphcontext – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a \emphsimilarity distance between the context-arm pairs which bounds from above the difference between the respective expected payoffs.Prior work on contextual bandits with similarity uses “uniform” partitions of the similarity space, so that each context-arm pair is approximated by the closest pair in the partition. Algorithms based on “uniform” partitions disregard the structure of the payoffs and the context arrivals, which is potentially wasteful. We present algorithms that are based on \em adaptive partitions, and take advantage of “benign” payoffs and context arrivals without sacrificing the worst-case performance. The central idea is to maintain a finer partition in high-payoff regions of the similarity space and in popular regions of the context space. Our results apply to several other settings, e.g. MAB with constrained temporal change \citepDynamicMAB-colt08 and sleeping bandits \citepsleeping-colt08.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/slivkins11a.html
http://proceedings.mlr.press/v19/slivkins11a.htmlCollaborative Filtering with the Trace Norm: Learning, Bounding, and TransducingTrace-norm regularization is a widely-used and successful approach for collaborative filtering and matrix completion. However, its theoretical understanding is surprisingly weak, and despite previous attempts, there are no distribution-free, non-trivial learning guarantees currently known. In this paper, we bridge this gap by providing such guarantees, under mild assumptions which correspond to collaborative filtering as performed in practice. In fact, we claim that previous difficulties partially stemmed from a mismatch betweenthe standard learning-theoretic modeling of collaborative filtering, and its practical application. Our results also shed some light on the issue of collaborative filtering with bounded models, which enforce predictions to lie within a certain range. In particular, we provide experimental and theoretical evidence that such models lead to a modest yet significant improvement.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/shamir11a.html
http://proceedings.mlr.press/v19/shamir11a.htmlOptimal aggregation of affine estimatorsWe consider the problem of combining a (possibly uncountably infinite) set of affine estimators innon-parametric regression model with heteroscedastic Gaussian noise. Focusing on the exponentially weighted aggregate, we prove a PAC-Bayesian type inequality that leadsto sharp oracle inequalities in discrete but also in continuous settings. The framework is general enough to cover the combinations of various procedures such asleast square regression, kernel ridge regression, shrinking estimators and many other estimators used in the literature on statistical inverse problems. As aconsequence, we show that the proposed aggregateprovides an adaptive estimator in the exact minimax sense without neither discretizing the range of tuningparameters nor splitting the set of observations. We also illustrate numerically the good performance achievedby the exponentially weighted aggregate.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/salmon11a.html
http://proceedings.mlr.press/v19/salmon11a.htmlSequential Event Prediction with Association RulesWe consider a supervised learning problem in which data are revealed sequentially and the goal is to determine what will next be revealed. In the context of this problem, algorithms based on association rules have a distinct advantage over classical statistical and machine learning methods; however, there has not previously been a theoretical foundation established for using association rules in supervised learning. We present two simple algorithms that incorporate association rules, and provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/rudin11a.html
http://proceedings.mlr.press/v19/rudin11a.htmlNeyman-Pearson classification under a strict constraintMotivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i), its probability of type I error is below a pre-specified level and (ii), it has probability of type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical objective subject to an empirical constraint. The novelty of the method is that the classifier output by this problem is shown to satisfy the original constraint on type I error. This strict enforcement of the constraint has interesting consequences on the control of the type II error and we develop new techniques to handle this situation. Finally, connections with chance constrained optimization are evident and are investigated.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/rigollet11a.html
http://proceedings.mlr.press/v19/rigollet11a.htmlOnline Learning: Beyond RegretWe study online learnability of a wide class of problems, extending the results of \citeRakSriTew10 to general notions of performance measure well beyond external regret. Our framework simultaneously captures such well-known notions as internal and general Φ-regret, learning with non-additive global cost functions, Blackwell’s approachability, calibration of forecasters, and more. We show that learnability in all these situations is due to control of the same three quantities: a martingale convergence term, a term describing the ability to perform well if future is known, and a generalization of sequential Rademacher complexity, studied in \citeRakSriTew10. Since we directly study complexity of the problem instead of focusing on efficient algorithms, we are able to improve and extend many known results which have been previously derived via an algorithmic construction.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/rakhlin11a.html
http://proceedings.mlr.press/v19/rakhlin11a.htmlThe Rate of Convergence of AdaboostThe AdaBoost algorithm of \citetFreundSc97 was designed to combine many “weak” hypotheses that perform slightly better than a random guess into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss” with a fast rate of convergence. Our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Specifically, our first result shows that at iteration t, the exponential loss of AdaBoost’s computed parameter vector will be at most \varepsilon more than that of any parameter vector of \ell_1-norm bounded by B in a number of rounds that is bounded by a polynomial in B and 1/\varepsilon. We also provide rate lower bound examples showing a polynomial dependence on these parameters is necessary. Our second result is that within C/\varepsilon iterations, AdaBoost achieves a value of the exponential loss that is at most \varepsilon more than the best possible value, where C depends on the dataset. We show that this dependence of the rate on \varepsilon is optimal up to constant factors, i.e. at least Ω(1/\varepsilon) rounds are necessary to achieve within \varepsilon of the optimal exponential loss.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/mukherjee11a.html
http://proceedings.mlr.press/v19/mukherjee11a.htmlMissing Information Impediments to LearnabilityTo what extent is learnability impeded when information is missing in learning instances? We present relevant known results and concrete open problems, in the context of a natural extension of the PAC learning model that accounts for arbitrarily missing information.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/michael11a.html
http://proceedings.mlr.press/v19/michael11a.htmlRobust approachability and regret minimization in games with partial monitoringApproachability has become a standard tool in analyzing learning algorithms in the adversarial online learning setup.We develop a variant of approachability for games where there is ambiguity in the obtained reward that belongs to a set, rather than being a single vector. Using this variant we tackle the problem of approachability in games with partial monitoring and develop simple and efficientalgorithms (i.e., with constant per-step complexity) for this setup.We finally consider external and internal regret in repeated games with partial monitoring, for which we deriveregret-minimizing strategies based on approachability theory.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/mannor11a.html
http://proceedings.mlr.press/v19/mannor11a.htmlA Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler DivergencesWe consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports(not necessarily known beforehand), whose asymptotic regret matches the lower bound of \citetBurnetas96. Our contribution is to provide a finite-time analysis of this algorithm;we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms).Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/maillard11a.html
http://proceedings.mlr.press/v19/maillard11a.htmlA New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic DataEfficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the α-th moment with ∆= |1-α|\approx0. To guarantee that the error of estimated Shannon entropy is within a ν-additive factor, the method of \em symmetric stable random projections requires O\left(\frac1ν^2∆^2\right) samples, which is extremely expensive. The first paper \citepProc:Li_SODA09 in \em Compressed Counting (CC), based on \em skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to O\left(\frac1ν^2∆\right), which is still expensive. The followup work \citepProc:Li_UAI09 provides a practical algorithm, which is however difficult to analyze theoretically.In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only O\left(\frac1ν^2\right) for ν-additive Shannon entropy estimation. The constant factor for this bound is merely about 6. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to 100% statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/li11a.html
http://proceedings.mlr.press/v19/li11a.htmlMinimax Algorithm for Learning RotationsIt is unknown what is the most suitable regularization for rotation matrices and how to maintain uncertainty overrotations in an online setting.We propose to address these questions by studying the minimax algorithm for rotations and begin by working out the 2-dimensional case.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/kotlowski11b.html
http://proceedings.mlr.press/v19/kotlowski11b.htmlMaximum Likelihood vs. Sequential Normalized Maximum Likelihood in On-line Density EstimationThe paper considers sequential prediction of individual sequences with log loss (online density estimation) using an exponential family of distributions. We first analyze the regret of the maximum likelihood (“follow the leader”) strategy. We find that this strategy is (1) suboptimal and (2) requires an additional assumption about boundedness of the data sequence. We then show that both problems can be be addressed by adding the currently predicted outcome to the calculation of the maximum likelihood, followed by normalization of the distribution. The strategy obtained in this way is known in the literature as the \emphsequential normalized maximum likelihood or \emphlast-step minimax strategy. We show for the first time that for general exponential families, the regret is bounded by the familiar (k/2) \log n and thus optimal up to O(1). We also show the relationship to the Bayes strategy with Jeffreys’ prior.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/kotlowski11a.html
http://proceedings.mlr.press/v19/kotlowski11a.htmlA Close Look to Margin Complexity and Related ParametersConcept classes can canonically be represented by sign-matrices,i.e., by matrices with entries 1 and -1. The question whethera sign-matrix (concept class) A can be learned by a machine that performs large margin classification is closely related to the“margin complexity” associated with A. We consider severalvariants of margin complexity, reveal how they are relatedto each other, and we reveal how they are related to other notions of learning-theoretic relevance like SQ-dimension, CSQ-dimension,and the Forster bound.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/kallweit11a.html
http://proceedings.mlr.press/v19/kallweit11a.htmlPrefacePreface to the Proceedings of the 24th Annual Conference on Learning Theory June 9-11, 2011, Budapest, Hungary.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/kakade11a.html
http://proceedings.mlr.press/v19/kakade11a.htmlA simple multi-armed bandit algorithm with optimal variation-bounded regretWe pose the question of whether it is possible to design a simple, linear-time algorithm for the basic multi-armed bandit problem in the adversarial setting which has a regret bound of O(\sqrtQ \log T), where Q is the total quadratic variation of all the arms.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/hazan11b.html
http://proceedings.mlr.press/v19/hazan11b.htmlBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimizationWe give a novel algorithm for stochastic strongly-convex optimization in thegradient oracle model which returns an O(\frac1T)-approximate solutionafter T gradient updates. This rate of convergence is optimal in the gradientoracle model. This improves upon the previously known best rate ofO(\frac\log(T)T), which was obtained by applying an onlinestrongly-convex optimization algorithm with regret O(\log(T)) to the batchsetting.We complement this result by proving that any algorithm has expected regret ofΩ(\log(T)) in the online stochastic strongly-convex optimizationsetting. This lower bound holds even in the full-information setting whichreveals more information to the algorithm than just gradients. This shows thatany online-to-batch conversion is inherently suboptimal for stochasticstrongly-convex optimization. This is the first formal evidence that onlineconvex optimization is strictly more difficult than batch stochastic convexoptimization.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/hazan11a.html
http://proceedings.mlr.press/v19/hazan11a.htmlBounds on Individual Risk for Log-loss PredictorsIn sequential prediction with log-loss as well as density estimationwith risk measured by KL divergence, one is often interested in the\em expected instantaneous loss, or, equivalently, the \em individual risk at a given fixed sample size n. For Bayesianprediction and estimation methods, it is often easy to obtain boundson the \em cumulative risk. Such results are based on bounding theindividual sequence regret, a technique that is very well known in theCOLT community. Motivated by the easiness of proofs for the cumulativerisk, our open problem is to use the results on cumulative risk to prove corresponding individual-risk bounds.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/grunwald11b.html
http://proceedings.mlr.press/v19/grunwald11b.htmlSafe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexityWe extend Bayesian MAP and Minimum Description Length (MDL) learning by testing whether the data can be substantially more compressed by a mixture of the MDL/MAP distribution with another element of themodel, and adjusting the learning rate if this is the case. While standard Bayes and MDL can fail to converge ifthe model is wrong, the resulting “safe” estimator continues toachieve good rates with wrong models. Moreover, when applied toclassification and regression models as considered in statisticallearning theory, the approach achieves optimal rates under, e.g.,Tsybakov’s conditions, and reveals new situations in which we canpenalize by (- \log \text\sc prior)/n rather than \sqrt(- \log \text\sc prior)/n.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/grunwald11a.html
http://proceedings.mlr.press/v19/grunwald11a.htmlSparsity Regret Bounds for Individual Sequences in Online Linear RegressionWe consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm on i.i.d. data and derive risk bounds of the same flavor as in \citetDaTsy08SEW,DaTsy10MirrorAveraging but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/gerchinovitz11a.html
http://proceedings.mlr.press/v19/gerchinovitz11a.htmlThe KL-UCB Algorithm for Bounded Stochastic Bandits and BeyondThis paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB algorithm.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/garivier11a.html
http://proceedings.mlr.press/v19/garivier11a.htmlOn the Consistency of Multi-Label LearningMulti-label learning has attracted much attention during the past few years. Many multi-label learning approaches have been developed, mostly working with surrogate loss functions since multi-label loss functions are usually difficult to optimize directly owing to non-convexity and discontinuity. Though these approaches are effective, to the best of our knowledge, there is no theoretical result on the convergence of risk of the learned functions to the Bayes risk. In this paper, focusing on two well-known multi-label loss functions, i.e., \textitranking loss and \textithamming loss, we prove a necessary and sufficient condition for the consistency of multi-label learning based on surrogate loss functions. Our results disclose that, surprisingly, none convex surrogate loss is consistent with the ranking loss. Inspired by the finding, we introduce the \textitpartial ranking loss, with which some surrogate functions are consistent. For hamming loss, we show that some recent multi-label learning approaches are inconsistent even for deterministic multi-label classification, and give a surrogate loss function which is consistent for the deterministic case. Finally, we discuss on the consistency of learning approaches which address multi-label learning by decomposing into a set of binary classification problems.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/gao11a.html
http://proceedings.mlr.press/v19/gao11a.htmlConcentration-Based Guarantees for Low-Rank Matrix ReconstructionWe consider the problem of approximately reconstructing a partially-observed, approximately low-rank matrix. This problem has received much attention lately, mostly using the trace-norm as a surrogate to the rank. Here we study low-rank matrix reconstruction using both the trace-norm, as well as the less-studied max-norm, and present reconstruction guarantees based on existing analysis on the Rademacher complexity of the unit balls of these norms. We show how these are superior in several ways to recently published guarantees based on specialized analysis.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/foygel11a.html
http://proceedings.mlr.press/v19/foygel11a.htmlComplexity-Based Approach to Calibration with Checking RulesWe consider the problem of forecasting a sequence of outcomes from an unknown source. The quality of the forecaster is measured by a family of checking rules. We prove upper bounds on the value of the associated game, thus certifying the existence of a calibrated strategy for the forecaster. We show that complexity of the family of checking rules can be captured by the notion of a sequential cover introduced in \citepRakSriTew10a. Various natural assumptions on the class of checking rules are considered, including finiteness of Vapnik-Chervonenkis and Littlestone’s dimensions.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/foster11a.html
http://proceedings.mlr.press/v19/foster11a.htmlDistribution-Independent Evolvability of Linear Threshold FunctionsValiant’s model of evolvability models the evolutionary process of acquiring useful functionality as a restricted form of learning from random examples \citepValiant:09.Linear threshold functions and their various subclasses, such as conjunctions and decision lists, play a fundamental role in learning theory and hence their evolvabilityhas been the primary focus of research on Valiant’s framework. One of the main open problems regarding the model is whether conjunctions are evolvable distribution-independently\citepFeldmanValiant:08colt. We show that the answer is negative. Our proof is based on a new combinatorial parameter of a concept class that lower-bounds the complexity of learning fromcorrelations.We contrast the lower bound with a proof that linear threshold functions having a non-negligible margin on the data points are evolvable distribution-independently via a simple mutationalgorithm. Our algorithm relies on a non-linear loss function being used to select the hypotheses instead of 0-1 loss in Valiant’s original definition. The proof of evolvabilityrequires that the loss function satisfies several mild conditions that are, for example, satisfied by the quadratic loss function studied in several other works \citepMichael:07,Feldman:09sqd,Valiantp:11manu. An important property of our evolution algorithm is monotonicity, that is the algorithm guaranteesevolvability without any decreases in performance. Previously, monotone evolvability was only shown for conjunctions with quadratic loss \citepFeldman:09sqd or when the distribution on the domain is severely restricted \citepMichael:07,Feldman:09sqd,KanadeVV:10.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/feldman11b.html
http://proceedings.mlr.press/v19/feldman11b.htmlLower Bounds and Hardness Amplification for Learning Shallow Monotone FormulasMuch work has been done on learning various classes of “simple"monotone functions under the uniform distribution.In this paper we give the first unconditional lower bounds for learningproblems of this sort by showing that polynomial-time algorithms cannotlearn shallow monotone Boolean formulas under the uniform distributionin the well-studied Statistical Query (SQ) model.We introduce a new approach to understanding the learnability of “simple” monotone functions that is based ona recent characterization of Strong SQ learnability by \citetSimon-2007Using the characterization we first show that depth-3 monotone formulas of size n^o(1) cannot be learned byany polynomial-time SQ algorithm to accuracy 1 -1/(\log n)^Ω(1). We then build on this result to show thatdepth-4 monotone formulas of size n^o(1) cannot be learned evento a certain \frac 1 2 + o(1) accuracy in polynomial time. Thisimproved hardness is achieved using a general technique that weintroduce for amplifying the hardness of “mildly hard” learningproblems in either the PAC or SQ framework. Thishardness amplification for learning builds on the ideas in the work of\citetODonnell-2002on hardness amplification forapproximating functions using small circuits, and is applicable to a number of other contexts.Finally, we demonstrate that our approach can also be used to reduce the well-known open problem of learning juntas to learning of depth-3 monotone formulas.\ignoreText version:Much work has been done on learning various classes of “simple"monotone functions under the uniform distribution.In this paper we give the first unconditional lower bounds for learningproblems of this sort by showing that polynomial-time algorithms cannotlearn constant-depth monotone Boolean formulas under the uniform distributionin the well-studied Statistical Query model.Using a recent characterization of Strong Statistical Querylearnability due to Feldman (FOCS 2009), we first show thatdepth-3 monotone formulas of size n^o(1) cannot be learned byany polynomial-time Statistical Query algorithm to accuracy 1 -1/(\log n)^Ω(1). We then build on this result to show thatdepth-4 monotone formulas of size n^o(1) cannot be learned evento a certain 1/2 + o(1) accuracy in polynomial time. Thisimproved hardness is achieved using a general technique that weintroduce for amplifying the hardness of “mildly hard” learningproblems in either the PAC or Statistical Query framework. Thishardness amplification for learning builds on the ideas in the work ofO’Donnell (STOC 2002) on hardness amplification forapproximating functions using small circuits, and is applicable to a number of other contexts.Finally, we demonstrate that our technique can also be used to reduce the well-known open problem of learning juntas to learning of depth-3 monotone formulas.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/feldman11a.html
http://proceedings.mlr.press/v19/feldman11a.htmlMulticlass Learnability and the ERM principleMulticlass learning is an area of growing practical relevance, for which thecurrently available theory is still far from providing satisfactoryunderstanding. We study the learnability of multiclass prediction, and deriveupper and lower bounds on the sample complexity of multiclass hypothesisclasses in different learning models: batch/online, realizable/unrealizable,full information/bandit feedback. Our analysis reveals a surprisingphenomenon: In the multiclass setting, in sharp contrast to binaryclassification, not all Empirical Risk Minimization (ERM) algorithms areequally successful. We show that there exist hypotheses classes for which someERM learners have lower sample complexity than others. Furthermore, there areclasses that are learnable by some ERM learners, while other ERM learner willfail to learn them. We propose a principle for designing good ERM learners, anduse this principle to prove tight bounds on the sample complexity of learning\em symmetric multiclass hypothesis classes (that is, classes that areinvariant under any permutation of label names). We demonstrate the relevanceof the theory by analyzing the sample complexity of two widely used hypothesisclasses: generalized linear multiclass models and reduction trees. We also obtainsome practically relevant conclusions.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/daniely11a.html
http://proceedings.mlr.press/v19/daniely11a.htmlTight conditions for consistent variable selection in high dimensional nonparametric regressionWe address the issue of variable selection in the regression model with very high ambient dimension, \textiti.e., when the number of covariates is very large. The main focus is on the situation where the number of relevant covariates, called intrinsic dimension, is much smaller than the ambient dimension. Without assuming any parametric form of the underlying regression function, we get tight conditions making it possible to consistently estimate the set of relevant variables. These conditions relate the intrinsic dimension to the ambient dimension and to the sample size. The procedure that is provably consistent under these tight conditions is simple and is based on comparing the empirical Fourier coefficients with an appropriately chosen threshold value.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/comminges11a.html
http://proceedings.mlr.press/v19/comminges11a.htmlSample Complexity Bounds for Differentially Private LearningThis work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy ofindividuals in the training set.In particular, the learning algorithm is required in this problem toguarantee differential privacy, a very strong notion of privacy that hasgained significant attention in recent years.A natural question to ask is: what is the sample requirement of a learningalgorithm that guarantees a certain level of privacy and accuracy?We address this question in the context of learning with infinite hypothesis classes whenthe data is drawn from a continuous distribution.We first show that even for very simple hypothesis classes, any algorithmthat uses a finite number of examples and guarantees differential privacymust fail to return an accurate classifier for at least some unlabeled datadistributions.This result is unlike the case with either finite hypothesis classes ordiscrete data domains, in which distribution-free private learning ispossible, as previously shown by \citetKLNRS08.We then consider two approaches to differentially private learning that get around this lower bound.The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution \U chosen independently of the sensitive data. Given such a reference \U, we provide an upper bound on the sample requirement that depends (among other things) on a measure of closenessbetween \U and the unlabeled data distribution. Our upper bound appliesto the non-realizable as well as the realizable case. The second approachis to relax the privacy requirement, by requiring only label-privacy –namely, that the only labels (and not the unlabeled parts of the examples)be considered sensitive information. An upper bound on the samplerequirement of learning with label privacy was shown by \citetCDKMT06; inthis work, we show a lower bound.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/chaudhuri11a.html
http://proceedings.mlr.press/v19/chaudhuri11a.htmlMinimax Regret of Finite Partial-Monitoring Games in Stochastic EnvironmentsIn a partial monitoring game, the learner repeatedly chooses an action, theenvironment responds with an outcome, and then the learner suffers a loss andreceives a feedback signal, both of which are fixed functions of the action andthe outcome. The goal of the learner is to minimize his regret, which is thedifference between his total cumulative loss and the total loss of the bestfixed action in hindsight.Assuming that the outcomes are generated in an i.i.d. fashion from an arbitrary andunknown probability distribution, we characterize the minimax regret of anypartial monitoring game with finitely many actions andoutcomes. It turns out that the minimax regret of any such game is either zero,\widetildeΘ(\sqrtT), Θ(T^2/3), or Θ(T). We provide a computationally efficient learningalgorithm that achieves the minimax regret within logarithmic factor for any game.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/bartok11a.html
http://proceedings.mlr.press/v19/bartok11a.htmlMinimax Policies for Combinatorial Prediction GamesWe address the online linear optimization problem when theactions of the forecaster are represented by binary vectors.Our goal is to understand the magnitude of the minimax regretfor the worst possible set of actions. We study the problemunder three different assumptions for the feedback: full information, and the partial information models of theso-called “semi-bandit”, and “bandit” problems. We consider both L_∞-, and L_2-type of restrictions forthe losses assigned by the adversary.We formulate a general strategy using Bregman projections on top of a potential-based gradient descent, which generalizes the ones studied in the series of papers \citeGLLO07, DHK08, AHR08, CL09, HW09, KWK10, UNK10, KRS10 and \citeAB10. We provide simpleproofs that recover most of the previous results. We propose new upper bounds for the semi-bandit game. Moreover we derive lower bounds for all three feedback assumptions. With the only exception of the bandit game, the upper and lower boundsare tight, up to a constant factor.Finally, we answer a question asked by \citeKWK10 by showing that the exponentially weighted average forecaster is suboptimal against L_∞ adversaries.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/audibert11a.html
http://proceedings.mlr.press/v19/audibert11a.htmlBandits, Query Learning, and the Haystack DimensionMotivated by multi-armed bandits (MAB) problems with a very large or even infinite number of arms, we consider the problem of finding a maximum of an unknown target function by querying the function at chosen inputs (or arms).We give an analysis of the query complexity of this problem, under the assumption that the payoff of each arm is given by a function belonging to a known, finite, but otherwise arbitrary function class. Our analysis centers on a new notion of function class complexity that we callthe \emphhaystack dimension, which is used to prove the approximate optimality of a simple greedy algorithm. This algorithm is then usedas a subroutine in a \parametric MAB algorithm, yielding provably near-optimal regret. We provide a generalization to the infinite cardinality setting, andcomment on how our analysis is connected to, and improves upon, existing results for query learning and generalized binary search.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/amin11a.html
http://proceedings.mlr.press/v19/amin11a.htmlOracle inequalities for computationally budgeted model selectionWe analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the effects of the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our computational budget to the best function class.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/agarwal11a.html
http://proceedings.mlr.press/v19/agarwal11a.htmlCompetitive Closeness TestingWe test whether two sequences are generated by the same distributionor by two different ones.Unlike previous work, we make no assumptions on the distributions’support size. Additionally, we compare our performance to thatof the best possible test.We describe an efficiently-computable algorithm based on\emphpattern maximum likelihood that is near optimal whenever the bestpossible error probability is \le\exp(-14n^2/3)using length-n sequences.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/acharya11a.html
http://proceedings.mlr.press/v19/acharya11a.htmlBlackwell Approachability and No-Regret Learning are EquivalentWe consider the celebrated Blackwell Approachability Theorem for two-player games with vector payoffs. Blackwell himself previously showed that the theorem implies the existence of a “no-regret” algorithm for a simple online learning problem. We show that this relationship is in fact much stronger, that Blackwell’s result is equivalent to, in a very strong sense, the problem of regret minimization for Online Linear Optimization. We show that any algorithm for one such problem can be efficiently converted into an algorithm for the other. We provide one novel application of this reduction: the first \emphefficient algorithm for calibrated forecasting.Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/abernethy11b.html
http://proceedings.mlr.press/v19/abernethy11b.htmlDoes an Efficient Calibrated Forecasting Strategy Exist?We recall two previously-proposed notions of \emphasymptotic calibration for a forecaster making a sequence of probability predictions. We note that the existence of efficient algorithms for calibrated forecasting holds only in the case of binary outcomes. We pose the question: do there exist such efficient algorithms for the general (non-binary) case?Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/abernethy11a.html
http://proceedings.mlr.press/v19/abernethy11a.htmlRegret Bounds for the Adaptive Control of Linear Quadratic SystemsWe study the average cost Linear Quadratic (LQ) control problem with unknown model parameters, also known as the adaptive control problem in the control community. We design an algorithm and prove that apart from logarithmic factors its regret up to time T is O(\sqrtT).Unlike previous approaches that use a forced-exploration scheme, we construct a high-probability confidence set around the model parameters and design an algorithm that plays optimistically with respect to this confidence set. The construction of the confidence set is based on the recent results from online least-squares estimation and leads to improved worst-case regret bound for the proposed algorithm.To the best of our knowledge this is the the first time that a regret bound is derived for the LQ control problem. %(That O(\sqrtT) is a minimax optimal rate follows from the existing lower bounds for linear bandit problems.)Wed, 21 Dec 2011 00:00:00 +0000
http://proceedings.mlr.press/v19/abbasi-yadkori11a.html
http://proceedings.mlr.press/v19/abbasi-yadkori11a.html