Proceedings of Machine Learning ResearchProceedings of the 28th International Conference on Algorithmic Learning Theory
Held in Kyoto University, Kyoto, Japan on 15-17 October 2017
Published as Volume 76 by the Proceedings of Machine Learning Research on 11 October 2017.
Volume Edited by:
Steve Hanneke
Lev Reyzin
Series Editors:
Neil D. Lawrence
Mark Reid
http://proceedings.mlr.press/v76/
Sun, 15 Oct 2017 08:36:32 +0000Sun, 15 Oct 2017 08:36:32 +0000Jekyll v3.5.2Learning from Networked ExamplesMany machine learning algorithms are based on the assumption that training examples are drawn independently. However, this assumption does not hold anymore when learning from a networked sample because two or more training examples may share some common objects, and hence share the features of these shared objects. We show that the classic approach of ignoring this problem potentially can have a harmful effect on the accuracy of statistics, and then consider alternatives. One of these is to only use independent examples, discarding other information. However, this is clearly suboptimal. We analyze sample error bounds in this networked setting, providing significantly improved results. An important component of our approach is formed by efficient sample weighting schemes, which leads to novel concentration inequalities.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/wang17a.html
http://proceedings.mlr.press/v76/wang17a.htmlA Strongly Quasiconvex PAC-Bayesian BoundWe propose a new PAC-Bayesian bound and a way of constructing a hypothesis space, so that the bound is convex in the posterior distribution and also convex in a trade-off parameter between empirical performance of the posterior distribution and its complexity. The complexity is measured by the Kullback-Leibler divergence to a prior. We derive an alternating procedure for minimizing the bound. We show that the bound can be rewritten as a one-dimensional function of the trade-off parameter and provide sufficient conditions under which the function has a single global minimum. When the conditions are satisfied the alternating minimization is guaranteed to converge to the global minimum of the bound. We provide experimental results demonstrating that rigorous minimization of the bound is competitive with cross-validation in tuning the trade-off between complexity and empirical performance. In all our experiments the trade-off turned to be quasiconvex even when the sufficient conditions were violated.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/thiemann17a.html
http://proceedings.mlr.press/v76/thiemann17a.htmlHypotheses testing on infinite random graphsDrawing on some recent results that provide the formalism necessary to definite stationarity for infinite random graphs, this paper initiates the study of statistical and learning questions pertaining to these objects. Specifically, a criterion for the existence of a consistent test for complex hypotheses is presented, generalizing the corresponding results on time series. As an application, it is shown how one can test that a tree has the Markov property, or, more generally, to estimate its memory.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/ryabko17b.html
http://proceedings.mlr.press/v76/ryabko17b.htmlUniversality of Bayesian mixture predictorsThe problem is that of sequential probability forecasting for discrete-valued time series. The data is generated by an unknown probability distribution over the space of all one-way infinite sequences. It is known that this measure belongs to a given set $\mathcal{C}$, but the latter is completely arbitrary (uncountably infinite, without any structure given). The performance is measured by asymptotic average log loss. In this work it is shown that the minimax asymptotic performance is always attainable, and it is attained by a Bayesian mixture over countably many measures from the set $\mathcal{C}$. This was previously only known for the case when the best achievable asymptotic error is 0. The new result can be interpreted as a complete-class theorem for prediction. It also contrasts previous results that show that in the non-realizable case all Bayesian mixtures may be suboptimal. This leads to a very general conclusion concerning model selection for a problem of sequential inference: it is better to take a model large enough to make sure it includes the process that generates the data, even if it entails positive asymptotic average loss, for otherwise any combination of predictors in the model class may be useless.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/ryabko17a.html
http://proceedings.mlr.press/v76/ryabko17a.htmlMinimax rates for cost-sensitive learning on manifolds with approximate nearest neighboursWe study the approximate nearest neighbour method for cost-sensitive classification on low-dimensional manifolds embedded within a high-dimensional feature space. We determine the minimax learning rates for distributions on a smooth manifold, in a cost-sensitive setting. This generalises a classic result of Audibert and Tsybakov. Building upon recent work of Chaudhuri and Dasgupta we prove that these minimax rates are attained by the approximate nearest neighbour algorithm, where neighbours are computed in a randomly projected low-dimensional space. In addition, we give a bound on the number of dimensions required for the projection which depends solely upon the <i>reach</i> and dimension of the manifold, combined with the regularity of the marginal.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/reeve17a.html
http://proceedings.mlr.press/v76/reeve17a.htmlSoft-Bayes: Prod for Mixtures of Experts with Log-LossWe consider prediction with expert advice under the log-loss with the goal of deriving efficient and robust algorithms. We argue that existing algorithms such as exponentiated gradient, online gradient descent and online Newton step do not adequately satisfy both requirements. Our main contribution is an analysis of the Prod algorithm that is robust to any data sequence and runs in linear time relative to the number of experts in each round. Despite the unbounded nature of the log-loss, we derive a bound that is independent of the largest loss and of the largest gradient, and depends only on the number of experts and the time horizon. Furthermore we give a Bayesian interpretation of Prod and adapt the algorithm to derive a tracking regret.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/orseau17a.html
http://proceedings.mlr.press/v76/orseau17a.htmlCollaborative Clustering: Sample Complexity and Efficient AlgorithmsWe study the problem of <i>collaborative clustering</i>. This problem is concerned with a set of items grouped into clusters that we wish to recover from ratings provided by users. The latter are also clustered, and each user rates a random but typical small number of items. The observed ratings are random variables whose distributions depend on the item and user clusters only. Unlike for collaborative filtering problems where one needs to recover both user and item clusters, here we only wish to classify items. The number of items rated by a user can be so small that anyway, estimating user clusters may be hopeless. For the collaborative clustering problem, we derive fundamental performance limits satisfied by any algorithm. Specifically, we identify the number of ratings needed to guarantee the existence of an algorithm recovering the clusters with a prescribed level of accuracy. We also propose SplitSpec, an algorithm whose performance matches these fundamental performance limit order-wise. In turn, SplitSpec is able to exploit, as much as this is possible, the users’ structure to improve the item cluster estimates.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/ok17a.html
http://proceedings.mlr.press/v76/ok17a.htmlA minimax and asymptotically optimal algorithm for stochastic banditsWe propose the $\text{kl-UCB}^{++}$ algorithm for regret minimization in stochastic bandit models with exponential families of distributions. We prove that it is simultaneously asymptotically optimal (in the sense of Lai and Robbins' lower bound) and minimax optimal. This is the first algorithm proved to enjoy these two properties at the same time. This work thus merges two different lines of research with simple and clear proofs.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/m%C3%A9nard17a.html
http://proceedings.mlr.press/v76/m%C3%A9nard17a.htmlEfficient tracking of a growing number of expertsWe consider a variation on the problem of prediction with expert advice, where new forecasters that were unknown until then may appear at each round. As often in prediction with expert advice, designing an algorithm that achieves near-optimal regret guarantees is straightforward, using aggregation of experts. However, when the comparison class is sufficiently rich, for instance when the best expert and the set of experts itself changes over time, such strategies naively require to maintain a prohibitive number of weights (typically exponential with the time horizon). By contrast, designing strategies that both achieve a near-optimal regret and maintain a reasonable number of weights is highly non-trivial. We consider three increasingly challenging objectives (simple regret, shifting regret and sparse shifting regret) that extend existing notions defined for a fixed expert ensemble; in each case, we design strategies that achieve tight regret bounds, adaptive to the parameters of the comparison class, while being computationally inexpensive. Moreover, our algorithms are anytime, agnostic to the number of incoming experts and completely parameter-free. Such remarkable results are made possible thanks to two simple but highly effective recipes: first the "abstention trick" that comes from the <i>specialist</i> framework and enables to handle the least challenging notions of regret, but is limited when addressing more sophisticated objectives. Second, the "muting trick" that we introduce to give more flexibility. We show how to combine these two tricks in order to handle the most challenging class of comparison strategies.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/mourtada17a.html
http://proceedings.mlr.press/v76/mourtada17a.htmlAn efficient query learning algorithm for zero-suppressed binary decision diagramsA ZDD is a directed acyclic graph that represents a family of sets over a fixed universe set. In this paper, we propose an algorithm that learns zero-suppressed binary decision diagrams (ZDDs) using membership and equivalence queries. If the target ZDD has $n$ nodes and the cardinality of the universe is $m$, our algorithm uses $n$ equivalence queries and at most $n(\lfloor \log m \rfloor + 4n)$ membership queries to learn the target ZDD.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/mizumoto17a.html
http://proceedings.mlr.press/v76/mizumoto17a.htmlBoundary Crossing for General Exponential FamiliesWe consider parametric exponential families of dimension $K$ on the real line. We study a variant of <i>boundary crossing probabilities</i> coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension $K$. Formally, our result is a concentration inequality that bounds the probability that $\mathcal{B}^\psi(\hat \theta_n,\theta^\star)\geq f(t/n)/n$, where $\theta^\star$ is the parameter of an unknown target distribution, $\hat \theta_n$ is the empirical parameter estimate built from $n$ observations, $\psi$ is the log-partition function of the exponential family and $\mathcal{B}^\psi$ is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function $f$ is logarithmic, as it enables to analyze the regret of the state-of-the-art KL-ucb and KL-ucb+ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when $K=1$, while we provide results for arbitrary finite dimension $K$, thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T.L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/maillard17a.html
http://proceedings.mlr.press/v76/maillard17a.htmlSpecifying a positive threshold function via extremal pointsAn extremal point of a positive threshold Boolean function $f$ is either a maximal zero or a minimal one. It is known that if $f$ depends on all its variables, then the set of its extremal points completely specifies $f$ within the universe of threshold functions. However, in some cases, $f$ can be specified by a smaller set. The minimum number of points in such a set is the specification number of $f$. Hu (1965) showed that the specification number of a threshold function of $n$ variables is at least $n+1$. Anthony et al. (1995) proved that this bound is attained for nested functions and conjectured that for all other threshold functions the specification number is strictly greater than $n+1$. In the present paper, we resolve this conjecture negatively by exhibiting threshold Boolean functions of $n$ variables, which are non-nested and for which the specification number is $n+1$. On the other hand, we show that the set of extremal points satisfies the statement of the conjecture, i.e.~a positive threshold Boolean function depending on all its $n$ variables has $n+1$ extremal points if and only if it is nested. To prove this, we reveal an underlying structure of the set of extremal points.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/lozin17a.html
http://proceedings.mlr.press/v76/lozin17a.htmlNew bounds on the price of bandit feedback for mistake-bounded online multiclass learningThis paper is about two generalizations of the mistake bound model to online multiclass classification. In the <i>standard model</i>, the learner receives the correct classification at the end of each round, and in the <i>bandit model</i>, the learner only finds out whether its prediction was correct or not. For a set $F$ of multiclass classifiers, let $\mathrm{opt}_{\mathrm{std}}(F)$ and $\mathrm{opt}_{\mathrm{bandit}}(F)$ be the optimal bounds for learning $F$ according to these two models. We show that an $$ \mathrm{opt}_{\mathrm{bandit}}(F) \leq (1 + o(1)) (|Y| \ln |Y|) \mathrm{opt}_{\mathrm{std}}(F) $$ bound is the best possible up to the leading constant, closing a $\Theta(\log |Y|)$ factor gap.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/long17a.html
http://proceedings.mlr.press/v76/long17a.htmlNormal Forms in Semantic Language IdentificationWe consider language learning in the limit from text where all learning restrictions are semantic, that is, where any conjecture may be replaced by a semantically equivalent conjecture. For different such learning criteria, starting with the well-known $\mathbf{Txt}\mathbf{G}\mathbf{Bc}$-learning, we consider three different normal forms: strongly locking learning, consistent learning and (partially) set-driven learning. These normal forms support and simplify proofs and give insight into what behaviors are necessary for successful learning (for example when consistency in conservative learning implies cautiousness and strong decisiveness). <br><br> We show that strongly locking learning can be assumed for partially set-driven learners, even when learning restrictions apply. We give a very general proof relying only on a natural property of the learning restriction, namely, allowing for simulation on equivalent text. Furthermore, when no restrictions apply, also the converse is true: every strongly locking learner can be made partially set-driven. For several semantic learning criteria we show that learning can be done consistently. Finally, we deduce for which learning restrictions partial set-drivenness and set-drivenness coincide, including a general statement about classes of infinite languages. The latter again relies on a simulation argument.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/k%C3%B6tzing17a.html
http://proceedings.mlr.press/v76/k%C3%B6tzing17a.htmlScale-Invariant Unconstrained Online LearningWe consider a variant of online convex optimization in which both the instances (input vectors) and the comparator (weight vector) are unconstrained. We exploit a natural scale invariance symmetry in our unconstrained setting: the predictions of the optimal comparator are invariant under any linear transformation of the instances. Our goal is to design online algorithms which also enjoy this property, i.e. are scale-invariant. We start with the case of coordinate-wise invariance, in which the individual coordinates (features) can be arbitrarily rescaled. We give an algorithm, which achieves essentially optimal regret bound in this setup, expressed by means of a coordinate-wise scale-invariant norm of the comparator. We then study general invariance with respect to arbitrary linear transformations. We first give a negative result, showing that no algorithm can achieve a meaningful bound in terms of scale-invariant norm of the comparator in the worst case. Next, we compliment this result with a positive one, providing an algorithm which "almost" achieves the desired bound, incurring only a logarithmic overhead in terms of the norm of the instances.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/kot%C5%82owski17a.html
http://proceedings.mlr.press/v76/kot%C5%82owski17a.htmlOn Compressive Ensemble Induced Regularisation: How Close is the Finite Ensemble Precision Matrix to the Infinite Ensemble?Averaging ensembles of randomly oriented low-dimensional projections of a singular covariance represent a novel and attractive means to obtain a well-conditioned inverse, which only needs access to random projections of the data. However, theoretical analyses so far have only been done at convergence, implying good properties for `large-enough' ensembles. But how large is `large enough'? Here we bound the expected difference in spectral norm between the finite ensemble precision matrix and the infinite ensemble, and based on this we give an estimate of the required ensemble size to guarantee the approximation error of the finite ensemble is below a given tolerance. Under mild assumptions, we find that for any given tolerance, the ensemble only needs to grow linearly in the original data dimension. A technical ingredient of our analysis is to upper bound the spectral norm of a matrix-variate T, which we then employ in conjunction with specific results from random matrix theory regarding the estimation of the covariance of random matrices.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/kab%C3%A1n17a.html
http://proceedings.mlr.press/v76/kab%C3%A1n17a.htmlA Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational BoundsRecently, much work has been done on extending the scope of online learning and incremental stochastic optimization algorithms. In this paper we contribute to this effort in two ways: First, based on a new regret decomposition and a generalization of Bregman divergences, we provide a self-contained, modular analysis of the two workhorses of online learning: (general) adaptive versions of Mirror Descent (MD) and the Follow-the-Regularized-Leader (FTRL) algorithms. The analysis is done with extra care so as not to introduce assumptions not needed in the proofs and allows to combine, in a straightforward way, different algorithmic ideas (e.g., adaptivity, optimism, implicit updates) and learning settings (e.g., strongly convex or composite objectives). This way we are able to reprove, extend and refine a large body of the literature, while keeping the proofs concise. The second contribution is a byproduct of this careful analysis: We present algorithms with improved variational bounds for smooth, composite objectives, including a new family of optimistic MD algorithms with only one projection step per round. Furthermore, we provide a simple extension of adaptive regret bounds to practically relevant non-convex problem settings with essentially no extra effort.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/joulani17a.html
http://proceedings.mlr.press/v76/joulani17a.htmlGraph Verification with a Betweenness OracleIn this paper, we examine the query complexity of verifying a hidden graph $G$ with a betweenness oracle. Let $G=(V,E)$ be a hidden graph and $\hat{G}=(V,\hat{E})$ be a known graph. $V$ and $\hat{E}$ are known and $E$ is not known. The graphs are connected, unweighted and have bounded maximum degree $\Delta$. The task of the graph verification problem is to verify that $E=\hat{E}$. We have access to $G$ through a black-box betweenness oracle. A betweenness oracle returns whether a vertex lies along a shortest path between two other vertices. The betweenness oracle nicely captures many real-world problems. We prove that graph verification can be done using $n^{1+o(1)}$ betweenness queries. Surprisingly, this matches the state of the art for the graph verification problem with the much stronger distance oracle. We also prove that graph verification requires $\Omega(n)$ betweenness queries -- a matching lower bound.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/janardhanan17a.html
http://proceedings.mlr.press/v76/janardhanan17a.htmlAutomatic Learning from Repetitive TextsWe study the connections between the learnability of automatic families of languages and the types of text used to present them to a learner. More precisely, we study how restrictions on the number of times that a correct datum appears in a text influence what classes of languages are automatically learnable. We show that an automatic family of languages is automatically learnable from fat text iff it is automatically learnable from thick text iff it is verifiable from balanced text iff it satisfies Angluin's tell-tale condition. Furthermore, many automatic families are automatically learnable from exponential text. We also study the relationship between automatic learnability and verifiability and show that all automatic families are automatically partially verifiable from exponential text and automatically learnable from thick text.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/h%C3%B6lzl17a.html
http://proceedings.mlr.press/v76/h%C3%B6lzl17a.htmlStructured Best Arm Identification with Fixed ConfidenceWe study the problem of identifying the best action among a set of possible options when the value of each action is given by a mapping from a number of noisy micro-observables in the so-called fixed confidence setting. Our main motivation is the application to minimax game search, which has been a major topic of interest in artificial intelligence. In this paper we introduce an abstract setting to clearly describe the essential properties of the problem. While previous work only considered a two-move-deep game tree search problem, our abstract setting can be applied to the general minimax games where the depth can be non-uniform and arbitrary, and transpositions are allowed. We introduce a new algorithm (LUCB-micro) for the abstract setting, and give its lower and upper sample complexity results. Our bounds recover some previous results, achieved in more limited settings, and also shed further light on how the structure of minimax problems influences sample complexity.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/huang17a.html
http://proceedings.mlr.press/v76/huang17a.htmlAlgorithmic Learning Theory (ALT) 2017: PrefaceWed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/hanneke17a.html
http://proceedings.mlr.press/v76/hanneke17a.htmlParameter identification in Markov chain choice modelsThis work studies the parameter identification problem for the Markov chain choice model of Blanchet, Gallego, and Goyal used in assortment planning. In this model, the product selected by a customer is determined by a Markov chain over the products, where the products in the offered assortment are absorbing states. The underlying parameters of the model were previously shown to be identifiable from the choice probabilities for the all-products assortment, together with choice probabilities for assortments of all-but-one products. Obtaining and estimating choice probabilities for such large assortments is not desirable in many settings. The main result of this work is that the parameters may be identified from assortments of sizes two and three, regardless of the total number of products. The result is obtained via a simple and efficient parameter recovery algorithm.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/gupta17a.html
http://proceedings.mlr.press/v76/gupta17a.htmlLearning MSO-definable hypotheses on stringsWe study the classification problems over string data for hypotheses specified by formulas of monadic second-order logic MSO. The goal is to design learning algorithms that run in time polynomial in the size of the training set, independently of or at least sublinear in the size of the whole data set. We prove negative as well as positive results. If the data set is an unprocessed string to which our algorithms have local access, then learning in sublinear time is impossible even for hypotheses definable in a small fragment of first-order logic. If we allow for a linear time pre-processing of the string data to build an index data structure, then learning of MSO-definable hypotheses is possible in time polynomial in the size of the training set, independently of the size of the whole data set.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/grohe17a.html
http://proceedings.mlr.press/v76/grohe17a.htmlPreference-based Teaching of Unions of Geometric ObjectsThis paper studies exact learning of unions of non-discretized geometric concepts in the model of preference-based teaching. In particular, it focuses on upper and lower bounds of the corresponding sample complexity parameter, the preference-based teaching dimension (PBTD), when learning disjoint unions of a bounded number of geometric concepts of various types -- for instance balls, axis-aligned cubes, or axis-aligned boxes -- in arbitrary dimensions. It is shown that the PBTD of disjoint unions of some such types of concepts grows linearly with the number of concepts in the union, independent of the dimensionality. Teaching the union of potentially overlapping objects turns out to be more involved and is hence considered here only for unions of up to two objects.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/gao17a.html
http://proceedings.mlr.press/v76/gao17a.htmlAdaptive Submodularity with Varying Query Sets: An Application to Active Multi-label LearningAdaptive submodular optimization, where a sequence of items is selected adaptively to optimize a submodular function, has been found to have many applications from sensor placement to active learning. In the current paper, we extend this work to the setting of multiple queries at each time step, where the set of available queries is randomly constrained. A primary contribution of this paper is to prove the first near optimal approximation bound for a greedy policy in this setting. A natural application of this framework is to crowd-sourced active learning problem where the set of available experts and examples might vary randomly. We instantiate the new framework for multi-label learning and evaluate it in multiple benchmark domains with promising results.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/fern17a.html
http://proceedings.mlr.press/v76/fern17a.htmlDealing with Range Anxiety in Mean Estimation via Statistical QueriesWe give algorithms for estimating the expectation of a given real-valued function $\phi:X\to \mathbb{R}$ on a sample drawn randomly from some unknown distribution $D$ over domain $X$, namely $\mathbf{E}_{\mathbf{x}\sim D}[\phi(\mathbf{x})]$. Our algorithms work in two well-studied models of restricted access to data samples. The first one is the statistical query (SQ) model in which an algorithm has access to an <i>SQ oracle</i> for the input distribution $D$ over $X$ instead of i.i.d. samples from $D$. Given a query function $\phi:X \to [0,1]$, the oracle returns an estimate of $\mathbf{E}_{\mathbf{x}\sim D}[\phi(\mathbf{x})]$ within some tolerance $\tau$. The second, is a model in which only a single bit is communicated from each sample. In both of these models the error obtained using a naive implementation would scale polynomially with the range of the random variable $\phi(\mathbf{x})$ (which might even be infinite). In contrast, without restrictions on access to data the expected error scales with the standard deviation of $\phi(\mathbf{x})$. Here we give a simple algorithm whose error scales linearly in standard deviation of $\phi(\mathbf{x})$ and logarithmically with an upper bound on the second moment of $\phi(\mathbf{x})$. <br><br> As corollaries, we obtain algorithms for high dimensional mean estimation and stochastic convex optimization in these models that work in more general settings than previously known solutions.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/feldman17b.html
http://proceedings.mlr.press/v76/feldman17b.htmlTight Bounds on $\ell_1$ Approximation and Learning of Self-Bounding FunctionsWe study the complexity of learning and approximation of self-bounding functions over the uniform distribution on the Boolean hypercube $\{0,1\}^n$. Informally, a function $f:\{0,1\}^n \rightarrow \mathbb{R}$ is self-bounding if for every $x \in \{0,1\}^n$, $f(x)$ upper bounds the sum of all the $n$ marginal decreases in the value of the function at $x$. Self-bounding functions include such well-known classes of functions as submodular and fractionally-subadditive (XOS) functions. They were introduced by Boucheron et al. in the context of concentration of measure inequalities. Our main result is a nearly tight $\ell_1$-approximation of self-bounding functions by low-degree juntas. Specifically, all self-bounding functions can be $\epsilon$-approximated in $\ell_1$ by a polynomial of degree $\tilde{O}(1/\epsilon)$ over $2^{\tilde{O}(1/\epsilon)}$ variables. We show that both the degree and junta-size are optimal up to logarithmic terms. Previous techniques considered stronger $\ell_2$ approximation and proved nearly tight bounds of $\Theta(1/\epsilon^{2})$ on the degree and $2^{\Theta(1/\epsilon^2)}$ on the number of variables. Our bounds rely on the analysis of noise stability of self-bounding functions together with a stronger connection between noise stability and $\ell_1$ approximation by low-degree polynomials. This technique can also be used to get tighter bounds on $\ell_1$ approximation by low-degree polynomials and faster learning algorithm for halfspaces. \newline These results lead to improved and in several cases almost tight bounds for PAC and agnostic learning of self-bounding functions relative to the uniform distribution. In particular, assuming hardness of learning juntas, we show that PAC and agnostic learning of self-bounding functions have complexity of $n^{\tilde{\Theta}(1/\epsilon)}$.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/feldman17a.html
http://proceedings.mlr.press/v76/feldman17a.htmlThe Complexity of Explaining Neural Networks Through (group) InvariantsEver since the work of Minsky and Papert, it has been thought that neural networks derive their effectiveness by finding representations of the data that are invariant with respect to the task. In other words, the representations eliminate components of the data that vary in a way that is irrelevant. These invariants are naturally expressed with respect to group operations, and thus an understanding of these groups is key to explaining the effectiveness of the neural network. Moreover, a line of work in deep learning has shown that explicit knowledge of group invariants can lead to more effective training results. <br><br> In this paper, we investigate the difficulty of discovering anything about these implicit invariants. Unfortunately, our main results are negative: we show that a variety of questions around investigating invariant representations are NP-hard, even in approximate settings. Moreover, these results do not depend on the kind of architecture used: in fact, our results follow as soon as the network architecture is powerful enough to be universal. The key idea behind our results is that if we can find the symmetries of a problem then we can solve it.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/ensign17a.html
http://proceedings.mlr.press/v76/ensign17a.htmlPAC Learning Depth-3 $\textrm{AC}^0$ Circuits of Bounded Top FaninAn important and long-standing question in computational learning theory is how to learn $\textrm{AC}^0$ circuits with respect to any distribution (i.e. PAC learning). All previous results either require that the underlying distribution is uniform Linial et al. (1993) (or simple variants of the uniform distribution) or restrict the depths of circuits being learned to 1 Valiant (1984) and 2 Klivans and Servedio (2004). As for the circuits of depth 3 or more, it is currently unknown how to PAC learn them. \newline In this paper we present an algorithm to PAC learn depth-3 $\textrm{AC}^0$ circuits of bounded top fanin over $(x_1,\cdots,x_n,\overline{x}_1,\cdots,\overline{x}_n)$. Our result is that every depth-3 $\textrm{AC}^0$ circuit of top fanin $K$ can be computed by a polynomial threshold function (PTF) of degree $\widetilde{O}(K\cdot n^{\frac{1}{2}})$, which means that it can be PAC learned in time $2^{\widetilde{O}(K\cdot n^{\frac{1}{2}})}$. In particular, when $K=O(n^{\epsilon_0})$ for any $\epsilon_0<\frac{1}{2}$, the time for learning is sub-exponential. We note that instead of employing some known tools we use some specific approximation in expressing such circuits in PTFs which can thus save a factor of $\textrm{polylog}(n)$ in degrees of the PTFs.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/ding17a.html
http://proceedings.mlr.press/v76/ding17a.htmlRelative Error Embeddings of the Gaussian Kernel DistanceA reproducing kernel defines an embedding of a data point into an infinite dimensional reproducing kernel Hilbert space (RKHS). The norm in this space describes a distance, which we call the kernel distance. The random Fourier features (of Rahimi and Recht) describe an oblivious approximate mapping into finite dimensional Euclidean space that behaves similar to the RKHS. We show in this paper that for the Gaussian kernel the Euclidean norm between these mapped to features has $(1+\varepsilon)$-relative error with respect to the kernel distance. When there are $n$ data points, we show that $O((1/\varepsilon^2) \log n)$ dimensions of the approximate feature space are sufficient and necessary. Without a bound on $n$, but when the original points lie in $\mathbb{R}^d$ and have diameter bounded by $\mathcal{M}$, then we show that $O((d/\varepsilon^2) \log \mathcal{M})$ dimensions are sufficient, and that this many are required, up to $\log(1/\varepsilon)$ factors. We empirically confirm that relative error is indeed preserved for kernel PCA using these approximate feature maps.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/chen17a.html
http://proceedings.mlr.press/v76/chen17a.htmlNon-Adaptive Randomized Algorithm for Group TestingWe study the problem of group testing with a non-adaptive randomized algorithm in the random incidence design (RID) model where each entry in the test is chosen randomly independently from $\{0,1\}$ with a fixed probability $p$. <br><br> The property that is sufficient and necessary for a unique decoding is the separability of the tests, but unfortunately no linear time algorithm is known for such tests. In order to achieve linear-time decodable tests, the algorithms in the literature use the disjunction property that gives almost optimal number of tests. <br><br> We define a new property for the tests which we call semi-disjunction property. We show that there is a linear time decoding for such test and for $d\to \infty$ the number of tests converges to the number of tests with the separability property. Our analysis shows that, in the RID model, the number of tests in our algorithm is better than the one with the disjunction property even for small $d$.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/bshouty17a.html
http://proceedings.mlr.press/v76/bshouty17a.htmlErasing Pattern Languages Distinguishable by a Finite Number of StringsPattern languages have been an object of study in various subfields of computer science for decades. This paper introduces and studies a decision problem on patterns called the <i>finite distinguishability</i> problem: given a pattern $\pi$, are there finite sets $T^+$ and $T^-$ of strings such that the only pattern language containing all strings in $T^+$ and none of the strings in $T^-$ is the language generated by $\pi$? This problem is related to the complexity of teacher-directed learning, as studied in computational learning theory, as well as to the long-standing open question whether the equivalence of two patterns is decidable. We show that finite distinguishability is decidable if the underlying alphabet is of size other than $2$ or $3$, and provide a number of related results, such as (i) partial solutions for alphabet sizes $2$ and $3$, and (ii) decidability proofs for variants of the problem for special subclasses of patterns, namely, regular, 1-variable, and non-cross patterns. For the same subclasses, we further determine the values of two complexity parameters in teacher-directed learning, namely the <i>teaching dimension</i> and the <i>recursive teaching dimension</i>.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/bayeh17a.html
http://proceedings.mlr.press/v76/bayeh17a.htmlLifelong Learning in Costly Feature SpacesAn important long-term goal in machine learning systems is to build learning agents that, like humans, can learn many tasks over their lifetime, and moreover use information from these tasks to improve their ability to do so efficiently. In this work, our goal is to provide new theoretical insights into the potential of this paradigm. In particular, we propose a lifelong learning framework that adheres to a novel notion of resource efficiency that is critical in many real-world domains where feature evaluations are costly. That is, our learner aims to reuse information from previously learned related tasks to learn future tasks in a <i>feature-efficient</i> manner. Furthermore, we consider novel combinatorial ways in which learning tasks can relate. Specifically, we design lifelong learning algorithms for two structurally different and widely used families of target functions: decision trees/lists and monomials/polynomials. We also provide strong feature-efficiency guarantees for these algorithms; in fact, we show that in order to learn future targets, we need only slightly more feature evaluations per training example than what is needed to predict on an arbitrary example using those targets. We also provide algorithms with guarantees in an agnostic model where not all the targets are related to each other. Finally, we also provide lower bounds on the performance of a lifelong learner in these models, which are in fact tight under some conditions.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/balcan17a.html
http://proceedings.mlr.press/v76/balcan17a.htmlThe Power of Random CounterexamplesLearning a target concept from a finite $n \times m$ concept space requires $\Omega{(n)}$ proper equivalence queries in the worst case. We propose a variation of the usual equivalence query in which the teacher is constrained to choose counterexamples randomly from a known probability distribution on examples. We present and analyze the Max-Min learning algorithm, which identifies an arbitrary target concept in an arbitrary finite $n \times m$ concept space using at most an expected $\log_2{n}$ proper equivalence queries with random counterexamples.Wed, 11 Oct 2017 00:00:00 +0000
http://proceedings.mlr.press/v76/angluin17a.html
http://proceedings.mlr.press/v76/angluin17a.html