Proceedings of Machine Learning ResearchProceedings of the 24th Annual Conference on Learning Theory
Held in Budapest, Hungary on 09-11 June 2011
Published as Volume 19 by the Proceedings of Machine Learning Research on 21 December 2011.
Volume Edited by:
Sham M. Kakade
Ulrike von Luxburg
Series Editors:
Neil D. Lawrence
https://proceedings.mlr.press/v19/
Wed, 08 Feb 2023 10:38:25 +0000Wed, 08 Feb 2023 10:38:25 +0000Jekyll v3.9.3Identifiability of Priors from Bounded Sample Sizes with Applications to Transfer LearningWe explore a transfer learning setting, in which a finite sequence of target concepts are sampled independently with an unknown distribution from a known family. We study the total number of labeled examples required to learn all targets to an arbitrary specified expected accuracy, focusing on the asymptotics in the number of tasks and the desired accuracy. Our primary interest is formally understanding the fundamental benefits of transfer learning, compared to learning each target independently from the others. Our approach to the transfer problem is general, in the sense that it can be used with a variety of learning protocols. The key insight driving our approach is that the distribution of the target concepts is identifiable from the joint distribution over a number of random labeled data points equal the Vapnik-Chervonenkis dimension of the concept space. This is not necessarily the case for the joint distribution over any smaller number of points. This work has particularly interesting implications when applied to active learning methods.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/yang11a.html
https://proceedings.mlr.press/v19/yang11a.htmlMixability is Bayes Risk Curvature Relative to Log LossMixability of a loss governs the best possible performance when aggregating expert predictions with respect to that loss. The determination of the mixability constant for binary losses is straightforward but opaque. In the binary case we make this transparent and simpler by characterising mixability in terms of the second derivative of the Bayes risk of proper losses. We then extend this result to multiclass proper losses where there are few existing results. We show that mixability is governed by the Hessian of the Bayes risk, relative to the Hessian of the Bayes risk for log loss. We conclude by comparing our result to other work that bounds prediction performance in terms of the geometry of the Bayes risk. Although all calculations are for proper losses, we also show how to carry the results across to improper losses.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/vanerven11a.html
https://proceedings.mlr.press/v19/vanerven11a.htmlThe Sample Complexity of Dictionary LearningA large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected $L_2$ error in representation when the dictionary is used. For the case of $l_1$ regularized coefficient selection we provide a generalization bound of the order of $O\left(\sqrt{np\ln(m λ)/m}\right)$, where $n$ is the dimension, $p$ is the number of elements in the dictionary, λis a bound on the $l_1$ norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most $k$ dictionary elements, we provide a bound ofthe order $O(\sqrt{np\ln(m k)/m})$ under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as $1/m$, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/vainsencher11a.html
https://proceedings.mlr.press/v19/vainsencher11a.htmlAgnostic KWIK learning and efficient approximate reinforcement learningA popular approach in reinforcement learning is to use a model-based algorithm, i.e., an algorithm that utilizes a model learner to learn an approximate model to the environment. It has been shown that such a model-based learner is efficient if the model learner is efficient in the so-called “knows what it knows” (KWIK) framework. A major limitation of the standard KWIK framework is that, by its very definition, it covers only the case when the (model) learner can represent the actual environment with no errors. In this paper, we study the agnostic KWIK learning model, where we relax this assumption by allowing nonzero approximation errors. We show that with the new definition an efficient model learner still leads to an efficient reinforcement learning algorithm. At the same time, though, we find that learning within the new framework can be substantially slower as compared to the standard framework, even in the case of simple learning problems.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/szita11a.html
https://proceedings.mlr.press/v19/szita11a.htmlAdaptive Density Level Set ClusteringClusters are often defined to be the connected components of a density level set. Unfortunately, this definition depends on a level that needs to be user specified by some means. In this paper we present a simple algorithm that is able to asymptotically determine the optimal level, that is, the level at which there is the first split in the cluster tree of the data generating distribution. We further show that this algorithm asymptotically recovers the corresponding connected components. Unlike previous work, our analysis does not require strong assumptions on the density such as continuity or even smoothness.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/steinwart11a.html
https://proceedings.mlr.press/v19/steinwart11a.htmlMonotone multi-armed bandit allocationsWe present a novel angle for multi-armed bandits (henceforth abbreviated MAB) which follows from the recent work on MAB mechanisms (Babaioff et al., 2009; Devanur and Kakade, 2009; Babaioff et al., 2010). The new problem is, essentially, about designing MAB algorithms under an additional constraint motivated by their application to MAB mechanisms. This note is self-contained, although some familiarity with MAB is assumed; we refer the reader to Cesa-Bianchi and Lugosi (2006) for more background.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/slivkins11b.html
https://proceedings.mlr.press/v19/slivkins11b.htmlContextual Bandits with Similarity InformationIn a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now well-understood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the context-arm pairs which bounds from above the difference between the respective expected payoffs. Prior work on contextual bandits with similarity uses “uniform” partitions of the similarity space, so that each context-arm pair is approximated by the closest pair in the partition. Algorithms based on “uniform” partitions disregard the structure of the payoffs and the context arrivals, which is potentially wasteful. We present algorithms that are based on adaptive partitions, and take advantage of “benign” payoffs and context arrivals without sacrificing the worst-case performance. The central idea is to maintain a finer partition in high-payoff regions of the similarity space and in popular regions of the context space. Our results apply to several other settings, e.g. MAB with constrained temporal change (Slivkins and Upfal, 2008) and sleeping bandits (Kleinberg et al., 2008a).Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/slivkins11a.html
https://proceedings.mlr.press/v19/slivkins11a.htmlCollaborative Filtering with the Trace Norm: Learning, Bounding, and TransducingTrace-norm regularization is a widely-used and successful approach for collaborative filtering and matrix completion. However, its theoretical understanding is surprisingly weak, and despite previous attempts, there are no distribution-free, non-trivial learning guarantees currently known. In this paper, we bridge this gap by providing such guarantees, under mild assumptions which correspond to collaborative filtering as performed in practice. In fact, we claim that previous difficulties partially stemmed from a mismatch betweenthe standard learning-theoretic modeling of collaborative filtering, and its practical application. Our results also shed some light on the issue of collaborative filtering with bounded models, which enforce predictions to lie within a certain range. In particular, we provide experimental and theoretical evidence that such models lead to a modest yet significant improvement.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/shamir11a.html
https://proceedings.mlr.press/v19/shamir11a.htmlOptimal aggregation of affine estimatorsWe consider the problem of combining a (possibly uncountably infinite) set of affine estimators in non-parametric regression model with heteroscedastic Gaussian noise. Focusing on the exponentially weighted aggregate, we prove a PAC-Bayesian type inequality that leads to sharp oracle inequalities in discrete but also in continuous settings. The framework is general enough to cover the combinations of various procedures such as least square regression, kernel ridge regression, shrinking estimators and many other estimators used in the literature on statistical inverse problems. As a consequence, we show that the proposed aggregate provides an adaptive estimator in the exact minimax sense without neither discretizing the range of tuning parameters nor splitting the set of observations. We also illustrate numerically the good performance achieved by the exponentially weighted aggregate.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/salmon11a.html
https://proceedings.mlr.press/v19/salmon11a.htmlSequential Event Prediction with Association RulesWe consider a supervised learning problem in which data are revealed sequentially and the goal is to determine what will next be revealed. In the context of this problem, algorithms based on association rules have a distinct advantage over classical statistical and machine learning methods; however, there has not previously been a theoretical foundation established for using association rules in supervised learning. We present two simple algorithms that incorporate association rules, and provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/rudin11a.html
https://proceedings.mlr.press/v19/rudin11a.htmlNeyman-Pearson classification under a strict constraintMotivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i), its probability of type I error is below a pre-specified level and (ii), it has probability of type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical objective subject to an empirical constraint. The novelty of the method is that the classifier output by this problem is shown to satisfy the original constraint on type I error. This strict enforcement of the constraint has interesting consequences on the control of the type II error and we develop new techniques to handle this situation. Finally, connections with chance constrained optimization are evident and are investigated.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/rigollet11a.html
https://proceedings.mlr.press/v19/rigollet11a.htmlOnline Learning: Beyond RegretWe study online learnability of a wide class of problems, extending the results of Rakhlin et al. (2010a) to general notions of performance measure well beyond external regret. Our framework simultaneously captures such well-known notions as internal and general $\Phi$-regret, learning with non-additive global cost functions, Blackwell’s approachability, calibration of forecasters, and more. We show that learnability in all these situations is due to control of the same three quantities: a martingale convergence term, a term describing the ability to perform well if future is known, and a generalization of sequential Rademacher complexity, studied in Rakhlin et al. (2010a). Since we directly study complexity of the problem instead of focusing on efficient algorithms, we are able to improve and extend many known results which have been previously derived via an algorithmic construction.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/rakhlin11a.html
https://proceedings.mlr.press/v19/rakhlin11a.htmlThe Rate of Convergence of AdaboostThe AdaBoost algorithm of Freund and Schapire (1997) was designed to combine many “weak” hypotheses that perform slightly better than a random guess into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss” with a fast rate of convergence. Our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Specifically, our first result shows that at iteration $t$, the exponential loss of AdaBoost’s computed parameter vector will be at most $\varepsilon$ more than that of any parameter vector of $\ell_1$-norm bounded by $B$ in a number of rounds that is bounded by a polynomial in $B$ and $1/\varepsilon$. We also provide rate lower bound examples showing a polynomial dependence on these parameters is necessary. Our second result is that within $C/\varepsilon$ iterations, AdaBoost achieves a value of the exponential loss that is at most $\varepsilon$ more than the best possible value, where $C$ depends on the dataset. We show that this dependence of the rate on $\varepsilon$ is optimal up to constant factors, i.e. at least $\Omega(1/\varepsilon)$ rounds are necessary to achieve within $\varepsilon$ of the optimal exponential loss.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/mukherjee11a.html
https://proceedings.mlr.press/v19/mukherjee11a.htmlMissing Information Impediments to LearnabilityTo what extent is learnability impeded when information is missing in learning instances? We present relevant known results and concrete open problems, in the context of a natural extension of the PAC learning model that accounts for arbitrarily missing information.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/michael11a.html
https://proceedings.mlr.press/v19/michael11a.htmlRobust approachability and regret minimization in games with partial monitoringApproachability has become a standard tool in analyzing learning algorithms in the adversarial online learning setup. We develop a variant of approachability for games where there is ambiguity in the obtained reward that belongs to a set, rather than being a single vector. Using this variant we tackle the problem of approachability in games with partial monitoring and develop simple and efficient algorithms (i.e., with constant per-step complexity) for this setup. We finally consider external and internal regret in repeated games with partial monitoring, for which we derive regret-minimizing strategies based on approachability theory.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/mannor11a.html
https://proceedings.mlr.press/v19/mannor11a.htmlA Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler DivergencesWe consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas and Katehakis (1996). Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms).Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/maillard11a.html
https://proceedings.mlr.press/v19/maillard11a.htmlA New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic DataEfficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the $\alpha$-th moment with $\delta= |1-\alpha|\approx 0$. To guarantee that the error of estimated Shannon entropy is within a $\upsilon$-additive factor, the method of symmetric stable random projections requires $O\left(\frac{1}{\upsilon^2\Delta^2}\right)$ samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to $O\left(\frac{1}{\upsilon^2\Delta}\right)$, which is still expensive. The followup work (Li, 2009b) provides a practical algorithm, which is however difficult to analyze theoretically. In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only $O\left(\frac{1}{\upsilon^2}\right)$ for $\upsilon$-additive Shannon entropy estimation. The constant factor for this bound is merely about $6$. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to $100\%$ statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/li11a.html
https://proceedings.mlr.press/v19/li11a.htmlMinimax Algorithm for Learning RotationsIt is unknown what is the most suitable regularization for rotation matrices and how to maintain uncertainty over rotations in an online setting. We propose to address these questions by studying the minimax algorithm for rotations and begin by working out the 2-dimensional case.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/kotlowski11b.html
https://proceedings.mlr.press/v19/kotlowski11b.htmlMaximum Likelihood vs. Sequential Normalized Maximum Likelihood in On-line Density EstimationThe paper considers sequential prediction of individual sequences with log loss (online density estimation) using an exponential family of distributions. We first analyze the regret of the maximum likelihood (“follow the leader”) strategy. We find that this strategy is (1) suboptimal and (2) requires an additional assumption about boundedness of the data sequence. We then show that both problems can be be addressed by adding the currently predicted outcome to the calculation of the maximum likelihood, followed by normalization of the distribution. The strategy obtained in this way is known in the literature as the sequential normalized maximum likelihood or last-step minimax strategy. We show for the first time that for general exponential families, the regret is bounded by the familiar $(k/2) \log n$ and thus optimal up to $O(1)$. We also show the relationship to the Bayes strategy with Jeffreys’ prior.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/kotlowski11a.html
https://proceedings.mlr.press/v19/kotlowski11a.htmlA Close Look to Margin Complexity and Related ParametersConcept classes can canonically be represented by sign-matrices, i.e., by matrices with entries $1$ and $-1$. The question whether a sign-matrix (concept class) $A$ can be learned by a machine that performs large margin classification is closely related to the “margin complexity” associated with $A$. We consider several variants of margin complexity, reveal how they are related to each other, and we reveal how they are related to other notions of learning-theoretic relevance like SQ-dimension, CSQ-dimension, and the Forster bound.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/kallweit11a.html
https://proceedings.mlr.press/v19/kallweit11a.htmlPrefacePreface to the Proceedings of the 24th Annual Conference on Learning Theory June 9-11, 2011, Budapest, Hungary.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/kakade11a.html
https://proceedings.mlr.press/v19/kakade11a.htmlA simple multi-armed bandit algorithm with optimal variation-bounded regretWe pose the question of whether it is possible to design a simple, linear-time algorithm for the basic multi-armed bandit problem in the adversarial setting which has a regret bound of $O(\sqrt{Q \log T})$, where $Q$ is the total quadratic variation of all the arms.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/hazan11b.html
https://proceedings.mlr.press/v19/hazan11b.htmlBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimizationWe give a novel algorithm for stochastic strongly-convex optimization in the gradient oracle model which returns an $O(\frac1T)$-approximate solution after $T$ gradient updates. This rate of convergence is optimal in the gradientoracle model. This improves upon the previously known best rate of $O(\frac{\log(T)}{T})$, which was obtained by applying an online strongly-convex optimization algorithm with regret $O(\log(T))$ to the batch setting. We complement this result by proving that any algorithm has expected regret of $\Omega(\log(T))$ in the online stochastic strongly-convex optimization setting. This lower bound holds even in the full-information setting which reveals more information to the algorithm than just gradients. This shows that any online-to-batch conversion is inherently suboptimal for stochastic strongly-convex optimization. This is the first formal evidence that online convex optimization is strictly more difficult than batch stochastic convex optimization.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/hazan11a.html
https://proceedings.mlr.press/v19/hazan11a.htmlBounds on Individual Risk for Log-loss PredictorsIn sequential prediction with log-loss as well as density estimationwith risk measured by KL divergence, one is often interested in the expected instantaneous loss, or, equivalently, the individual risk at a given fixed sample size $n$. For Bayesianprediction and estimation methods, it is often easy to obtain bounds on the cumulative risk. Such results are based on bounding the individual sequence regret, a technique that is very well known in the COLT community. Motivated by the easiness of proofs for the cumulative risk, our open problem is to use the results on cumulative risk to prove corresponding individual-risk bounds.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/grunwald11b.html
https://proceedings.mlr.press/v19/grunwald11b.htmlSafe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexityWe extend Bayesian MAP and Minimum Description Length (MDL) learning by testing whether the data can be substantially more compressed by a mixture of the MDL/MAP distribution with another element of the model, and adjusting the learning rate if this is the case. While standard Bayes and MDL can fail to converge if the model is wrong, the resulting “safe” estimator continues to achieve good rates with wrong models. Moreover, when applied to classification and regression models as considered in statistical learning theory, the approach achieves optimal rates under, e.g.,Tsybakov’s conditions, and reveals new situations in which we can penalize by $(- \log \mathrm{PRIOR})/n$ rather than $\sqrt{(- \log \mathrm{PRIOR})/n}$.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/grunwald11a.html
https://proceedings.mlr.press/v19/grunwald11a.htmlSparsity Regret Bounds for Individual Sequences in Online Linear RegressionWe consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension $d$ can be much larger than the number of time rounds $T$. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm on i.i.d. data and derive risk bounds of the same flavor as in Dalalyan and Tsybakov (2008, 2011) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/gerchinovitz11a.html
https://proceedings.mlr.press/v19/gerchinovitz11a.htmlThe KL-UCB Algorithm for Bounded Stochastic Bandits and BeyondThis paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB algorithm.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/garivier11a.html
https://proceedings.mlr.press/v19/garivier11a.htmlOn the Consistency of Multi-Label LearningMulti-label learning has attracted much attention during the past few years. Many multi-label learning approaches have been developed, mostly working with surrogate loss functions since multi-label loss functions are usually difficult to optimize directly owing to non-convexity and discontinuity. Though these approaches are effective, to the best of our knowledge, there is no theoretical result on the convergence of risk of the learned functions to the Bayes risk. In this paper, focusing on two well-known multi-label loss functions, i.e., ranking loss and hamming loss, we prove a necessary and sufficient condition for the consistency of multi-label learning based on surrogate loss functions. Our results disclose that, surprisingly, none convex surrogate loss is consistent with the ranking loss. Inspired by the finding, we introduce the partial ranking loss, with which some surrogate functions are consistent. For hamming loss, we show that some recent multi-label learning approaches are inconsistent even for deterministic multi-label classification, and give a surrogate loss function which is consistent for the deterministic case. Finally, we discuss on the consistency of learning approaches which address multi-label learning by decomposing into a set of binary classification problems.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/gao11a.html
https://proceedings.mlr.press/v19/gao11a.htmlConcentration-Based Guarantees for Low-Rank Matrix ReconstructionWe consider the problem of approximately reconstructing a partially-observed, approximately low-rank matrix. This problem has received much attention lately, mostly using the trace-norm as a surrogate to the rank. Here we study low-rank matrix reconstruction using both the trace-norm, as well as the less-studied max-norm, and present reconstruction guarantees based on existing analysis on the Rademacher complexity of the unit balls of these norms. We show how these are superior in several ways to recently published guarantees based on specialized analysis.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/foygel11a.html
https://proceedings.mlr.press/v19/foygel11a.htmlComplexity-Based Approach to Calibration with Checking RulesWe consider the problem of forecasting a sequence of outcomes from an unknown source. The quality of the forecaster is measured by a family of checking rules. We prove upper bounds on the value of the associated game, thus certifying the existence of a calibrated strategy for the forecaster. We show that complexity of the family of checking rules can be captured by the notion of a sequential cover introduced in (Rakhlin et al., 2010a). Various natural assumptions on the class of checking rules are considered, including finiteness of Vapnik-Chervonenkis and Littlestone’s dimensions.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/foster11a.html
https://proceedings.mlr.press/v19/foster11a.htmlDistribution-Independent Evolvability of Linear Threshold FunctionsValiant’s model of evolvability models the evolutionary process of acquiring useful functionality as a restricted form of learning from random examples (Valiant, 2009). Linear threshold functions and their various subclasses, such as conjunctions and decision lists, play a fundamental role in learning theory and hence their evolvability has been the primary focus of research on Valiant’s framework. One of the main open problems regarding the model is whether conjunctions are evolvable distribution-independently (Feldman and Valiant, 2008). We show that the answer is negative. Our proof is based on a new combinatorial parameter of a concept class that lower-bounds the complexity of learning from correlations. We contrast the lower bound with a proof that linear threshold functions having a non-negligible margin on the data points are evolvable distribution-independently via a simple mutation algorithm. Our algorithm relies on a non-linear loss function being used to select the hypotheses instead of 0-1 loss in Valiant’s original definition. The proof of evolvability requires that the loss function satisfies several mild conditions that are, for example, satisfied by the quadratic loss function studied in several other works (Michael, 2007; Feldman, 2009b; Valiant, 2011). An important property of our evolution algorithm is monotonicity, that is the algorithm guarantees evolvability without any decreases in performance. Previously, monotone evolvability was only shown for conjunctions with quadratic loss (Feldman, 2009b) or when the distribution on the domain is severely restricted (Michael, 2007; Feldman, 2009b; Kanade et al., 2010).Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/feldman11b.html
https://proceedings.mlr.press/v19/feldman11b.htmlLower Bounds and Hardness Amplification for Learning Shallow Monotone FormulasMuch work has been done on learning various classes of “simple" monotone functions under the uniform distribution. In this paper we give the first unconditional lower bounds for learning problems of this sort by showing that polynomial-time algorithms cannot learn shallow monotone Boolean formulas under the uniform distribution in the well-studied Statistical Query (SQ) model. We introduce a new approach to understanding the learnability of “simple” monotone functions that is based on a recent characterization of Strong SQ learnability by Simon (2007). Using the characterization we first show that depth-3 monotone formulas of size $n^{o(1)}$ cannot be learned by any polynomial-time SQ algorithm to accuracy $1 - 1/(\log n)^{\Omega(1)}$. We then build on this result to show that depth-4 monotone formulas of size $n^{o(1)}$ cannot be learned even to a certain $\frac 1 2 + o(1)$ accuracy in polynomial time. This improved hardness is achieved using a general technique that we introduce for amplifying the hardness of “mildly hard” learning problems in either the PAC or SQ framework. Thi shardness amplification for learning builds on the ideas in the work of O’Donnell (2004) on hardness amplification for approximating functions using small circuits, and is applicable to a number of other contexts. Finally, we demonstrate that our approach can also be used to reduce the well-known open problem of learning juntas to learning of depth-3 monotone formulas.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/feldman11a.html
https://proceedings.mlr.press/v19/feldman11a.htmlMulticlass Learnability and the ERM principleMulticlass learning is an area of growing practical relevance, for which the currently available theory is still far from providing satisfactory understanding. We study the learnability of multiclass prediction, and derive upper and lower bounds on the sample complexity of multiclass hypothesis classes in different learning models: batch/online, realizable/unrealizable,full information/bandit feedback. Our analysis reveals a surprising phenomenon: In the multiclass setting, in sharp contrast to binary classification, not all Empirical Risk Minimization (ERM) algorithms are equally successful. We show that there exist hypotheses classes for which some ERM learners have lower sample complexity than others. Furthermore, there are classes that are learnable by some ERM learners, while other ERM learner will fail to learn them. We propose a principle for designing good ERM learners, and use this principle to prove tight bounds on the sample complexity of learning symmetric multiclass hypothesis classes (that is, classes that are invariant under any permutation of label names). We demonstrate the relevance of the theory by analyzing the sample complexity of two widely used hypothesis classes: generalized linear multiclass models and reduction trees. We also obtain some practically relevant conclusions.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/daniely11a.html
https://proceedings.mlr.press/v19/daniely11a.htmlTight conditions for consistent variable selection in high dimensional nonparametric regressionWe address the issue of variable selection in the regression model with very high ambient dimension, i.e., when the number of covariates is very large. The main focus is on the situation where the number of relevant covariates, called intrinsic dimension, is much smaller than the ambient dimension. Without assuming any parametric form of the underlying regression function, we get tight conditions making it possible to consistently estimate the set of relevant variables. These conditions relate the intrinsic dimension to the ambient dimension and to the sample size. The procedure that is provably consistent under these tight conditions is simple and is based on comparing the empirical Fourier coefficients with an appropriately chosen threshold value.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/comminges11a.html
https://proceedings.mlr.press/v19/comminges11a.htmlSample Complexity Bounds for Differentially Private LearningThis work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithmth at uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution $\mathcal{U}$ chosen independently of the sensitive data. Given such a reference $\mathcal{U}$, we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between $\mathcal{U}$ and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy –namely, that the only labels (and not the unlabeled parts of the examples)be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/chaudhuri11a.html
https://proceedings.mlr.press/v19/chaudhuri11a.htmlMinimax Regret of Finite Partial-Monitoring Games in Stochastic EnvironmentsIn a partial monitoring game, the learner repeatedly chooses an action, the environment responds with an outcome, and then the learner suffers a loss and receives a feedback signal, both of which are fixed functions of the action and the outcome. The goal of the learner is to minimize his regret, which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight. Assuming that the outcomes are generated in an i.i.d. fashion from an arbitrary and unknown probability distribution, we characterize the minimax regret of any partial monitoring game with finitely many actions and outcomes. It turns out that the minimax regret of any such game is either zero, $\widetilde{\Theta}(\sqrt{T}), \Theta(T^{2/3})$, or $\Theta(T)$. We provide a computationally efficient learning algorithm that achieves the minimax regret within logarithmic factor for any game.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/bartok11a.html
https://proceedings.mlr.press/v19/bartok11a.htmlMinimax Policies for Combinatorial Prediction GamesWe address the online linear optimization problem when the actions of
the forecaster are represented by binary vectors. Our goal is to
understand the magnitude of the minimax regret for the worst possible
set of actions. We study the problem under three different
assumptions for the feedback: full information, and the partial
information models of the so-called “semi-bandit”, and “bandit”
problems. We consider both $L_\infty$, and $L_2$-type of
restrictions for the losses assigned by the adversary. We formulate a
general strategy using Bregman projections on top of a
potential-based gradient descent, which generalizes the ones studied
in the series of papers Gyorgy et al. (2007); Dani et al. (2008);
Abernethy et al. (2008); Cesa-Bianchi and Lugosi (2009); Helmbold
and Warmuth (2009); Koolen et al. (2010); Uchiya et al. (2010); Kale
et al. (2010) and Audibert and Bubeck (2010). We provide simple
proofs that recover most of the previous results. We propose new
upper bounds for the semi-bandit game. Moreover we derive lower
bounds for all three feedback assumptions. With the only exception
of the bandit game, the upper and lower bounds are tight, up to a
constant factor. Finally, we answer a question asked by Koolen et
al. (2010) by showing that the exponentially weighted average
forecaster is suboptimal against $L_\infty$ adversaries.
Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/audibert11a.html
https://proceedings.mlr.press/v19/audibert11a.htmlBandits, Query Learning, and the Haystack DimensionMotivated by multi-armed bandits (MAB) problems with a very large or even infinite number of arms, we consider the problem of finding a maximum of an unknown target function by querying the function at chosen inputs (or arms). We give an analysis of the query complexity of this problem, under the assumption that the payoff of each arm is given by a function belonging to a known, finite, but otherwise arbitrary function class. Our analysis centers on a new notion of function class complexity that we call the haystack dimension, which is used to prove the approximate optimality of a simple greedy algorithm. This algorithm is then used as a subroutine in a functional MAB algorithm, yielding provably near-optimal regret. We provide a generalization to the infinite cardinality setting, and comment on how our analysis is connected to, and improves upon, existing results for query learning and generalized binary search.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/amin11a.html
https://proceedings.mlr.press/v19/amin11a.htmlOracle inequalities for computationally budgeted model selectionWe analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the effects of the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our computational budget to the best function class.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/agarwal11a.html
https://proceedings.mlr.press/v19/agarwal11a.htmlCompetitive Closeness TestingWe test whether two sequences are generated by the same distribution or by two different ones. Unlike previous work, we make no assumptions on the distributions’ support size. Additionally, we compare our performance to that of the best possible test. We describe an efficiently-computable algorithm based on pattern maximum likelihood that is near optimal whenever the best possible error probability is $\le\exp(-14n^{2/3})$ using length-$n$ sequences.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/acharya11a.html
https://proceedings.mlr.press/v19/acharya11a.htmlBlackwell Approachability and No-Regret Learning are EquivalentWe consider the celebrated Blackwell Approachability Theorem for two-player games with vector payoffs. Blackwell himself previously showed that the theorem implies the existence of a “no-regret” algorithm for a simple online learning problem. We show that this relationship is in fact much stronger, that Blackwell’s result is equivalent to, in a very strong sense, the problem of regret minimization for Online Linear Optimization. We show that any algorithm for one such problem can be efficiently converted into an algorithm for the other. We provide one novel application of this reduction: the first efficient algorithm for calibrated forecasting.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/abernethy11b.html
https://proceedings.mlr.press/v19/abernethy11b.htmlDoes an Efficient Calibrated Forecasting Strategy Exist?We recall two previously-proposed notions of asymptotic calibration for a forecaster making a sequence of probability predictions. We note that the existence of efficient algorithms for calibrated forecasting holds only in the case of binary outcomes. We pose the question: do there exist such efficient algorithms for the general (non-binary) case?Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/abernethy11a.html
https://proceedings.mlr.press/v19/abernethy11a.htmlRegret Bounds for the Adaptive Control of Linear Quadratic SystemsWe study the average cost Linear Quadratic (LQ) control problem with unknown model parameters, also known as the adaptive control problem in the control community. We design an algorithm and prove that apart from logarithmic factors its regret up to time $T$ is $O(\sqrt{T})$. Unlike previous approaches that use a forced-exploration scheme, we construct a high-probability confidence set around the model parameters and design an algorithm that plays optimistically with respect to this confidence set. The construction of the confidence set is based on the recent results from online least-squares estimation and leads to improved worst-case regret bound for the proposed algorithm. To the best of our knowledge this is the the first time that a regret bound is derived for the LQ control problem.Wed, 21 Dec 2011 00:00:00 +0000
https://proceedings.mlr.press/v19/abbasi-yadkori11a.html
https://proceedings.mlr.press/v19/abbasi-yadkori11a.html