Proceedings of Machine Learning ResearchProceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
Held in San Diego, California, USA on 09-12 May 2015
Published as Volume 38 by the Proceedings of Machine Learning Research on 21 February 2015.
Volume Edited by:
Guy Lebanon
S. V. N. Vishwanathan
Series Editors:
Neil D. Lawrence
Mark Reid
https://proceedings.mlr.press/v38/
Wed, 08 Feb 2023 10:40:19 +0000Wed, 08 Feb 2023 10:40:19 +0000Jekyll v3.9.3The Log-Shift Penalty for Adaptive Estimation of Multiple Gaussian Graphical ModelsSparse Gaussian graphical models characterize sparse dependence relationships between random variables in a network. To estimate multiple related Gaussian graphical models on the same set of variables, we formulate a hierarchical model, which leads to an optimization problem with a nonconvex log-shift penalty function. We show that under mild conditions the optimization problem is convex despite the inclusion of a nonconvex penalty, and derive an efficient optimization algorithm. Experiments on both synthetic and real data show that the proposed method is able to achieve good selection and estimation performance simultaneously, because the nonconvexity of the log-shift penalty allows for weak signals to be thresholded to zero without excessive shrinkage on the strong signals.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/zhu15.html
https://proceedings.mlr.press/v38/zhu15.htmlPower-Law Graph CutsAlgorithms based on spectral graph cut objectives such as normalized cuts, ratio cuts and ratio association have become popular in recent years because they are widely applicable and simple to implement via standard eigenvector computations. Despite strong performance for a number of clustering tasks, spectral graph cut algorithms still suffer from several limitations: first, they require the number of clusters to be known in advance, but this information is often unknown a priori; second, they tend to produce clusters with uniform sizes. In some cases, the true clusters exhibit a known size distribution; in image segmentation, for instance, human- segmented images tend to yield segment sizes that follow a power-law distribution. In this paper, we propose a general framework of power-law graph cut algorithms that produce clusters whose sizes are power-law distributed, and also does not fix the number of clusters upfront. To achieve our goals, we treat the Pitman-Yor exchangeable partition probability function (EPPF) as a regularizer to graph cut objectives. Because the result- ing objectives cannot be solved by relaxing via eigenvectors, we derive a simple iterative algorithm to locally optimize the objectives. Moreover, we show that our proposed algorithm can be viewed as performing MAP inference on a particular Pitman-Yor mixture model. Our experiments on various data sets show the effectiveness of our algorithms.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/zhou15b.html
https://proceedings.mlr.press/v38/zhou15b.htmlInfinite Edge Partition Models for Overlapping Community Detection and Link PredictionA hierarchical gamma process infinite edge partition model is proposed to factorize the binary adjacency matrix of an unweighted undirected relational network under a Bernoulli-Poisson link. The model describes both homophily and stochastic equivalence, and is scalable to big sparse networks by focusing its computation on pairs of linked nodes. It can not only discover overlapping communities and inter-community interactions, but also predict missing edges. A simplified version omitting inter-community interactions is also provided and we reveal its interesting connections to existing models. The number of communities is automatically inferred in a nonparametric Bayesian manner, and efficient inference via Gibbs sampling is derived using novel data augmentation techniques. Experimental results on four real networks demonstrate the models’ scalability and state-of-the-art performance.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/zhou15a.html
https://proceedings.mlr.press/v38/zhou15a.htmlScalable Nonparametric Multiway Data AnalysisMultiway data analysis deals with multiway arrays, i.e., tensors, and the goal is twofold: predicting missing entries by modeling the interactions between array elements and discovering hidden patterns, such as clusters or communities in each mode. Despite the success of existing tensor factorization approaches, they are either unable to capture nonlinear interactions, or computationally expensive to handle massive data. In addition, most of the existing methods lack a principled way to discover latent clusters, which is important for better understanding of the data. To address these issues, we propose a scalable nonparametric tensor decomposition model. It employs Dirichlet process mixture (DPM) prior to model the latent clusters; it uses local Gaussian processes (GPs) to capture nonlinear relationships and to improve scalability. An efficient online variational Bayes Expectation-Maximization algorithm is proposed to learn the model. Experiments on both synthetic and real-world data show that the proposed model is able to discover latent clusters with higher prediction accuracy than competitive methods. Furthermore, the proposed model obtains significantly better predictive performance than the state-of-the-art large scale tensor decomposition algorithm, GigaTensor, on two large datasets with billions of entries.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/zhe15.html
https://proceedings.mlr.press/v38/zhe15.htmlA Simple Homotopy Algorithm for Compressive SensingIn this paper, we consider the problem of recovering the s largest elements of an arbitrary vector from noisy measurements. Inspired by previous work, we develop an homotopy algorithm which solves the \ell_1-regularized least square problem for a sequence of decreasing values of the regularization parameter. Compared to the previous method, our algorithm is more efficient in the sense it only updates the solution once for each intermediate problem, and more practical in the sense it has a simple stopping criterion by checking the sparsity of the intermediate solution. Theoretical analysis reveals that our method enjoys a linear convergence rate in reducing the recovery error. Furthermore, our guarantee for recovering the top s elements of the target vector is tighter than previous results, and that for recovering the target vector itself matches the state of the art in compressive sensing.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/zhang15.html
https://proceedings.mlr.press/v38/zhang15.htmlMinimizing Nonconvex Non-Separable FunctionsRegularization has played a key role in deriving sensible estimators in high dimensional statistical inference. A substantial amount of recent works has argued for nonconvex regularizers in favor of their superior theoretical properties and excellent practical performances. In a different but analogous vein, nonconvex loss functions are promoted because of their robustness against "outliers". However, these nonconvex formulations are computationally more challenging, especially in the presence of nonsmoothness and non-separability. To address this issue, we propose a new proximal gradient meta-algorithm by rigorously extending the proximal average to the nonconvex setting. We formally prove its nice convergence properties, and illustrate its effectiveness on two applications: multi-task graph-guided fused lasso and robust support vector machines. Experiments demonstrate that our method compares favorably against other alternatives.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/yu15.html
https://proceedings.mlr.press/v38/yu15.htmlA la Carte – Learning Fast KernelsKernel methods have great promise for learning rich statistical representations of large modern datasets. However, compared to neural networks, kernel methods have been perceived as lacking in scalability and flexibility. We introduce a family of fast, flexible, general purpose, and lightly parametrized kernel learning methods, derived from Fastfood basis function expansions. We provide mechanisms to learn the properties of groups of spectral frequencies in these expansions, which require only O(m log d) time and O(m) memory, for m basis functions and d input dimensions. We show that the proposed methods can learn a wide class of kernels, outperforming the alternatives in accuracy, speed, and memory consumption.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/yang15b.html
https://proceedings.mlr.press/v38/yang15b.htmlMajorization-Minimization for Manifold EmbeddingNonlinear dimensionality reduction by manifold embedding has become a popular and powerful approach both for visualization and as preprocessing for predictive tasks, but more efficient optimization algorithms are still crucially needed. Majorization-Minimization (MM) is a promising approach that monotonically decreases the cost function, but it remains unknown how to tightly majorize the manifold embedding objective functions such that the resulting MM algorithms are efficient and robust. We propose a new MM procedure that yields fast MM algorithms for a wide variety of manifold embedding problems. In our majorization step, two parts of the cost function are respectively upper bounded by quadratic and Lipschitz surrogates, and the resulting upper bound can be minimized in closed form. For cost functions amenable to such QL-majorization, the MM yields monotonic improvement and is efficient: in experiments the newly developed MM algorithms outperform five state-of-the-art optimization approaches in manifold embedding tasks.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/yang15a.html
https://proceedings.mlr.press/v38/yang15a.htmlStochastic Block Transition Models for Dynamic NetworksThere has been great interest in recent years on statistical models for dynamic networks. In this paper, I propose a stochastic block transition model (SBTM) for dynamic networks that is inspired by the well-known stochastic block model (SBM) for static networks and previous dynamic extensions of the SBM. Unlike most existing dynamic network models, it does not make a hidden Markov assumption on the edge-level dynamics, allowing the presence or absence of edges to directly influence future edge probabilities while retaining the interpretability of the SBM. I derive an approximate inference procedure for the SBTM and demonstrate that it is significantly better at reproducing durations of edges in real social network data.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/xu15.html
https://proceedings.mlr.press/v38/xu15.htmlUnderstanding and Evaluating Sparse Linear Discriminant AnalysisLinear discriminant analysis (LDA) represents a simple yet powerful technique for partitioning a p-dimensional feature vector into one of K classes based on a linear projection learned from N labeled observations. However, it is well-established that in the high-dimensional setting (p > N) the underlying projection estimator degenerates. Moreover, any linear discriminate function involving a large number of features may be difficult to interpret. To ameliorate these issues, two general categories of sparse LDA modifications have been proposed, both to reduce the number of active features and to stabilize the resulting projections. The first, based on optimal scoring, is more straightforward to implement and analyze but has been heavily criticized for its ambiguous connection with the original LDA formulation. In contrast, a second strategy applies sparse penalty functions directly to the original LDA objective but requires additional heuristic trade-off parameters, has unknown global and local minima properties, and requires a greedy sequential optimization procedure. In all cases the choice of sparse regularizer can be important, but no rigorous guidelines have been provided regarding which penalty might be preferable. Against this backdrop, we winnow down the broad space of candidate sparse LDA algorithms and promote a specific selection based on optimal scoring coupled with a particular, complementary sparse regularizer. This overall process ultimately progresses our understanding of sparse LDA in general, while leading to targeted modifications of existing algorithms that produce superior results in practice on three high-dimensional gene data sets.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/wu15.html
https://proceedings.mlr.press/v38/wu15.htmlRevisiting the Limits of MAP Inference by MWSS on Perfect GraphsA recent, promising approach to identifying a configuration of a discrete graphical model with highest probability (termed MAP inference) is to reduce the problem to finding a maximum weight stable set (MWSS) in a derived weighted graph, which, if perfect, allows a solution to be found in polynomial time. Weller and Jebara (2013) investigated the class of binary pairwise models where this method may be applied. However, their analysis made a seemingly innocuous assumption which simplifies analysis but led to only a subset of possible reparameterizations being considered. Here we introduce novel techniques and consider all cases, demonstrating that this greatly expands the set of tractable models. We provide a simple, exact characterization of the new, enlarged set and show how such models may be efficiently identified, thus settling the power of the approach on this class.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/weller15.html
https://proceedings.mlr.press/v38/weller15.htmlA Greedy Homotopy Method for Regression with Nonconvex ConstraintsThe goal of this paper is to estimate sparse linear regression models, where for a given partition \mathcalG of input variables, the selected variables are chosen from a \it diverse set of groups in \mathcalG. We consider a novel class of nonconvex constraint functions, and develop RepLasso, a greedy homotopy method that exploits geometrical properties of the constraint functions to build a sequence of suitably adapted convex surrogate problems. We prove that in some situations RepLasso recovers the global minima path of the nonconvex problem. Moreover, even if it does not recover the global minima, we prove that it will often do no worse than the Lasso in terms of (signed) support recovery, while in practice outperforming it. We show empirically that the strategy can also be used to improve over various other Lasso-style algorithms. Finally, a GWAS of ankylosing spondylitis highlights our method’s practical utility.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/wauthier15.html
https://proceedings.mlr.press/v38/wauthier15.htmlTrend Filtering on GraphsWe introduce a family of adaptive estimators on graphs, based on penalizing the \ell_1 norm of discrete graph differences. This generalizes the idea of trend filtering (Kim et al., 2009, Tibshirani, 2014) used for univariate nonparametric regression, to graphs. Analogous to the univariate case, graph trend filtering exhibits a level of local adaptivity unmatched by the usual \ell_2-based graph smoothers. It is also defined by a convex minimization problem that is readily solved (e.g., by fast ADMM or Newton algorithms). We demonstrate the merits of graph trend filtering through examples and theory.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/wang15d.html
https://proceedings.mlr.press/v38/wang15d.htmlColumn Subset Selection with Missing Data via Active SamplingColumn subset selection of massive data matrices has found numerous applications in real-world data systems. In this paper, we propose and analyze two sampling based algorithms for column subset selection without access to the complete input matrix. To our knowledge, these are the first algorithms for column subset selection with missing data that are provably correct. The proposed methods work for row/column coherent matrices by employing the idea of adaptive sampling. Furthermore, when the input matrix has a noisy low-rank structure, one algorithm enjoys a relative error bound.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/wang15c.html
https://proceedings.mlr.press/v38/wang15c.htmlMulti-Manifold Modeling in Non-Euclidean spacesThis paper advocates a novel framework for segmenting a dataset on a Riemannian manifold M into clusters lying around low-dimensional submanifolds of M. Important examples of M, for which the proposed algorithm is computationally efficient, include the sphere, the set of positive definite matrices, and the Grassmannian. The proposed algorithm constructs a data-affinity matrix by thoroughly exploiting the intrinsic geometry and then applies spectral clustering. Local geometry is encoded by sparse coding and directional information of local tangent spaces and geodesics, which is important in resolving intersecting clusters and establishing the theoretical guarantees for a simplified variant of the algorithm. To avoid complication, these guarantees assume that the underlying submanifolds are geodesic. Extensive validation on synthetic and real data demonstrates the resiliency of the proposed method against deviations from the theoretical (geodesic) model as well as its superior performance over state-of-the-art techniques.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/wang15b.html
https://proceedings.mlr.press/v38/wang15b.htmlFalling Rule ListsFalling rule lists are classification models consisting of an ordered list of if-then rules, where (i) the order of rules determines which example should be classified by each rule, and (ii) the estimated probability of success decreases monotonically down the list. These kinds of rule lists are inspired by healthcare applications where patients would be stratified into risk sets and the highest at-risk patients should be considered first. We provide a Bayesian framework for learning falling rule lists that does not rely on traditional greedy decision tree learning methods.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/wang15a.html
https://proceedings.mlr.press/v38/wang15a.htmlMaximally Informative Hierarchical Representations of High-Dimensional DataWe consider a set of probabilistic functions of some input variables as a representation of the inputs. We present bounds on how informative a representation is about input data. We extend these bounds to hierarchical representations so that we can quantify the contribution of each layer towards capturing the information in the original data. The special form of these bounds leads to a simple, bottom-up optimization procedure to construct hierarchical representations that are also maximally informative about the data. This optimization has linear computational complexity and constant sample complexity in the number of variables. These results establish a new approach to unsupervised learning of deep representations that is both principled and practical. We demonstrate the usefulness of the approach on both synthetic and real-world data.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/versteeg15.html
https://proceedings.mlr.press/v38/versteeg15.htmlLearning of Non-Parametric Control Policies with High-Dimensional State FeaturesLearning complex control policies from high-dimensional sensory input is a challenge for reinforcement learning algorithms. Kernel methods that approximate values functions or transition models can address this problem. Yet, many current approaches rely on instable greedy maximization. In this paper, we develop a policy search algorithm that integrates robust policy updates and kernel embeddings. Our method can learn non-parametric control policies for infinite horizon continuous MDPs with high-dimensional sensory representations. We show that our method outperforms related approaches, and that our algorithm can learn an underpowered swing-up task task directly from high-dimensional image data.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/vanhoof15.html
https://proceedings.mlr.press/v38/vanhoof15.htmlParticle Gibbs with Ancestor Sampling for Probabilistic ProgramsParticle Markov chain Monte Carlo techniques rank among current state-of-the-art methods for probabilistic program inference. A drawback of these techniques is that they rely on importance resampling, which results in degenerate particle trajectories and a low effective sample size for variables sampled early in a program. We here develop a formalism to adapt ancestor resampling, a technique that mitigates particle degeneracy, to the probabilistic programming setting. We present empirical results that demonstrate nontrivial performance gains.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/vandemeent15.html
https://proceedings.mlr.press/v38/vandemeent15.htmlMissing at Random in Graphical ModelsThe notion of missing at random (MAR) plays a central role in the theory underlying current methods for handling missing data. However the standard definition of MAR is difficult to interpret in practice. In this paper, we assume the missing data model is represented as a directed acyclic graph that not only encodes the dependencies among the variables but also explicitly portrays the causal mechanisms responsible for the missingness process. We introduce an intuitively appealing notion of MAR in such graphical models, and establish its relation with the standard MAR and a few versions of MAR used in the literature. We address the question of whether MAR is testable, given that data are corrupted by missingness, by proposing a general method for identifying testable implications imposed by the graphical structure on the observed data.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/tian15.html
https://proceedings.mlr.press/v38/tian15.htmlStreaming Variational Inference for Bayesian Nonparametric Mixture ModelsIn theory, Bayesian nonparametric (BNP) models are well suited to streaming data scenarios due to their ability to adapt model complexity based on the amount of data observed. Unfortunately, such benefits have not been fully realized in practice; existing inference algorithms either are not applicable to streaming applications or are not extensible to nonparametric models. For the special case of Dirichlet processes, streaming inference has been considered. However, there is growing interest in more flexible BNP models, in particular building on the class of normalized random measures (NRMs). We work within this general framework and present a streaming variational inference algorithm for NRM mixture models based on assumed density filtering. Extensions to expectation propagation algorithms are possible in the batch data setting. We demonstrate the efficacy of the algorithm on clustering documents in large, streaming text corpora.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/tank15.html
https://proceedings.mlr.press/v38/tank15.htmlPredicting Preference Reversals via Gaussian Process Uncertainty AversionModeling of a product or service’s attractiveness as a function of its own attributes (e.g., price and quality) is one of the foundations in econometric forecasts, which have been provided with an assumption that each human rationally has a consistent preference order among his choice decisions. Yet the preference orders by real humans become irrationally reversed, when the choice set of available options is manipulated. In order to accurately predict choice decisions involving preference reversals, which existing econometric methods have failed to incorporate, the authors introduce a new cognitive choice model whose parameters are efficiently fitted with a global convex optimization algorithm. The proposed model captures each human as a Bayesian decision maker facing a mental conflict between objective evaluation samples and a subjective prior, where the underlying objective evaluation function is rationally independent from contexts while the subjective prior is irrationally determined by each choice set. As the key idea to analytically handle the irrationality and to yield the convex optimization, the Bayesian decision mechanism is implemented as a closed-form Gaussian process regression using similarities among the available options in each context. By explaining the irrational decisions as a consequence of averting uncertainty, the proposed model outperformed the existing econometric models in predicting the irrational choice decisions recorded in real-world datasets.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/takahashi15.html
https://proceedings.mlr.press/v38/takahashi15.htmlTwo-stage sampled learning theory on distributionsWe focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only samples from sampled distributions are observable, and one has to build an estimate on similarities computed between sets of points. To the best of our knowledge, the only existing method with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which suffers from slow convergence issues in high dimensions), and the domain of the distributions to be compact Euclidean. In this paper, we provide theoretical guarantees for a remarkably simple algorithmic alternative to solve the distribution regression problem: embed the distributions to a reproducing kernel Hilbert space, and learn a ridge regressor from the embeddings to the outputs. Our main contribution is to prove the consistency of this technique in the two-stage sampled setting under mild conditions (on separable, topological domains endowed with kernels). As a special case, we answer a 15-year-old open question: we establish the consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002] in regression, and cover more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010].Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/szabo15.html
https://proceedings.mlr.press/v38/szabo15.htmlInferring Block Structure of Graphical Models in Exponential FamiliesLearning the structure of a graphical model is a fundamental problem and it is used extensively to infer the relationship between random variables. In many real world applications, we usually have some prior knowledge about the underlying graph structure, such as degree distribution and block structure. In this paper, we propose a novel generative model for describing the block structure in general exponential families, and optimize it by an Expectation-Maximization(EM) algorithm with variational Bayes. Experimental results show that our method performs well on both synthetic and real data. Further, our method can predict overlapped block structure of a graphical model in general exponential families.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/sun15.html
https://proceedings.mlr.press/v38/sun15.htmlA Dirichlet Process Mixture Model for Spherical DataDirectional data, naturally represented as points on the unit sphere, appear in many applications. However, unlike the case of Euclidean data, flexible mixture models on the sphere that can capture correlations, handle an unknown number of components and extend readily to high-dimensional data have yet to be suggested. For this purpose we propose a Dirichlet process mixture model of Gaussian distributions in distinct tangent spaces (DP-TGMM) to the sphere. Importantly, the formulation of the proposed model allows the extension of recent advances in efficient inference for Bayesian nonparametric models to the spherical domain. Experiments on synthetic data as well as real-world 3D surface normal and 20-dimensional semantic word vector data confirm the expressiveness and applicability of the DP-TGMM.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/straub15.html
https://proceedings.mlr.press/v38/straub15.htmlCalibration of conditional composite likelihood for Bayesian inference on Gibbs random fieldsGibbs random fields play an important role in statistics, however, the resulting likelihood is typically unavailable due to an intractable normalizing constant. Composite likelihoods offer a principled means to construct useful approximations. This paper provides a mean to calibrate the posterior distribution resulting from using a composite likelihood and illustrate its performance in several examples.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/stoehr15.html
https://proceedings.mlr.press/v38/stoehr15.htmlWASP: Scalable Bayes via barycenters of subset posteriorsThe promise of Bayesian methods for big data sets has not fully been realized due to the lack of scalable computational algorithms. For massive data, it is necessary to store and process subsets on different machines in a distributed manner. We propose a simple, general, and highly efficient approach, which first runs a posterior sampling algorithm in parallel on different machines for subsets of a large data set. To combine these subset posteriors, we calculate the Wasserstein barycenter via a highly efficient linear program. The resulting estimate for the Wasserstein posterior (WASP) has an atomic form, facilitating straightforward estimation of posterior summaries of functionals of interest. The WASP approach allows posterior sampling algorithms for smaller data sets to be trivially scaled to huge data. We provide theoretical justification in terms of posterior consistency and algorithm efficiency. Examples are provided in complex settings including Gaussian process regression and nonparametric Bayes mixture models.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/srivastava15.html
https://proceedings.mlr.press/v38/srivastava15.htmlData modeling with the elliptical gamma distributionWe study mixture modeling using the elliptical gamma (EG) distribution, a non-Gaussian distribution that allows heavy and light tail and peak behaviors. We first consider maximum likelihood parameter estimation, a task that turns out to be very challenging: we must handle positive definiteness constraints, and more crucially, we must handle possibly nonconcave log-likelihoods, which makes maximization hard. We overcome these difficulties by developing algorithms based on fixed-point theory; our methods respect the psd constraint, while also efficiently solving the (possibly) nonconcave maximization to global optimality. Subsequently, we focus on mixture modeling using EG distributions: we present a closed-form expression of the KL-divergence between two EG distributions, which we then combine with our ML estimation methods to obtain an efficient split-and-merge expectation maximization algorithm. We illustrate the use of our model and algorithms on a dataset of natural image patches.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/sra15.html
https://proceedings.mlr.press/v38/sra15.htmlLearning from Data with Heterogeneous Noise using SGDWe consider learning from data of variable quality that may be obtained from different heterogeneous sources. Addressing learning from heterogeneous data in its full generality is a challenging problem. In this paper, we adopt instead a model in which data is observed through heterogeneous noise, where the noise level reflects the quality of the data source. We study how to use stochastic gradient algorithms to learn in this model. Our study is motivated by two concrete examples where this problem arises naturally: learning with local differential privacy based on data from multiple sources with different privacy requirements, and learning from data with labels of variable quality. The main contribution of this paper is to identify how heterogeneous noise impacts performance. We show that given two datasets with heterogeneous noise, the order in which to use them in standard SGD depends on the learning rate. We propose a method for changing the learning rate as a function of the heterogeneity, and prove new regret bounds for our method in two cases of interest. Finally, we evaluate the performance of our algorithm on real data.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/song15.html
https://proceedings.mlr.press/v38/song15.htmlState Space Methods for Efficient Inference in Student-t Process RegressionThe added flexibility of Student-t processes (TPs) over Gaussian processes (GPs) robustifies inference in outlier-contaminated noisy data. The uncertainties are better accounted for than in GP regression, because the predictive covariances explicitly depend on the training observations. For an entangled noise model, the canonical-form TP regression problem can be solved analytically, but the naive TP and GP solutions share the same cubic computational cost in the number of training observations. We show how a large class of temporal TP regression models can be reformulated as state space models, and how a forward filtering and backward smoothing recursion can be derived for solving the inference analytically in linear time complexity. This is a novel finding that generalizes the previously known connection between Gaussian process regression and Kalman filtering to more general elliptical processes and non-Gaussian Bayesian filtering. We derive this connection, demonstrate the benefits of the approach with examples, and finally apply the method to empirical data.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/solin15.html
https://proceedings.mlr.press/v38/solin15.htmlLearning Where to Sample in Structured PredictionIn structured prediction, most inference algorithms allocate a homogeneous amount of computation to all parts of the output, which can be wasteful when different parts vary widely in terms of difficulty. In this paper, we propose a heterogeneous approach that dynamically allocates computation to the different parts. Given a pre-trained model, we tune its inference algorithm (a sampler) to increase test-time throughput. The inference algorithm is parametrized by a meta-model and trained via reinforcement learning, where actions correspond to sampling candidate parts of the output, and rewards are log-likelihood improvements. The meta-model is based on a set of domain-general meta-features capturing the progress of the sampler. We test our approach on five datasets and show that it attains the same accuracy as Gibbs sampling but is 2 to 5 times faster.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/shi15.html
https://proceedings.mlr.press/v38/shi15.htmlExploiting Symmetries to Construct Efficient MCMC Algorithms With an Application to SLAMThe Metropolis-Hastings (MH) algorithm is a flexible method to generate samples from a target distribution, a key problem in probabilistic inference. In this paper we propose a variation of the MH algorithm based on group moves, where the next state is obtained by first choosing a random transformation of the state space and then applying this transformation to the current state. This adds much-needed flexibility to the "textbook" MH algorithm where all measures involved must be given in terms of densities with respect to a common reference measure. Under mild conditions, our main result extends the acceptance probability formula of the textbook algorithm to MH algorithms with group moves. We work out how the new algorithms can be used to exploit a problem’s natural symmetries and apply the technique to the simultaneous localization and mapping (SLAM) problem, obtaining the first fully rigorous justification of a previous MCMC-based SLAM method. New experimental results comparing our method to existing state-of-the-art specialized methods on a standard range-only SLAM benchmark problem validate the strength of the approach.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/shariff15.html
https://proceedings.mlr.press/v38/shariff15.htmlEstimation from Pairwise Comparisons: Sharp Minimax Bounds with Topology DependenceConsider the problem of identifying the underlying qualities of a set of items based on measuring noisy comparisons between pairs of items. The Bradley-Terry-Luce (BTL) and Thurstone models are the most widely used parametric models for such pairwise comparison data. Working within a standard minimax framework, this paper provides sharp upper and lower bounds on the optimal error in estimating the underlying qualities under the BTL and the Thurstone models. These bounds are are topology-aware, meaning that they change qualitatively depending on the comparison graph induced by the subset of pairs being compared. Thus, in settings where the subset of pairs may be chosen, our results provide some principled guidelines for making this choice. Finally, we compare these error rates to those under cardinal measurement models and show that the error rates in the ordinal and cardinal settings have identical scalings apart from constant pre-factors. We use this result to investigate the relative merits of cardinal and ordinal measurement schemes.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/shah15.html
https://proceedings.mlr.press/v38/shah15.htmlInference of Cause and Effect with Unsupervised Inverse RegressionWe address the problem of causal discovery in the two-variable case given a sample from their joint distribution. The proposed method is based on a known assumption that, if X -> Y (X causes Y), the marginal distribution of the cause, P(X), contains no information about the conditional distribution P(Y|X). Consequently, estimating P(Y|X) from P(X) should not be possible. However, estimating P(X|Y) based on P(Y) may be possible. This paper employs this asymmetry to propose CURE, a causal discovery method which decides upon the causal direction by comparing the accuracy of the estimations of P(Y|X) and P(X|Y). To this end, we propose a method for estimating a conditional from samples of the corresponding marginal, which we call unsupervised inverse GP regression. We evaluate CURE on synthetic and real data. On the latter, our method outperforms existing causal inference methods.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/sgouritsa15.html
https://proceedings.mlr.press/v38/sgouritsa15.htmlA Rate of Convergence for Mixture Proportion Estimation, with Application to Learning from Noisy LabelsMixture proportion estimation (MPE) is a fundamental tool for solving a number of weakly supervised learning problems – supervised learning problems where label information is noisy or missing. Previous work on MPE has established a universally consistent estimator. In this work we establish a rate of convergence for mixture proportion estimation under an appropriate distributional assumption, and argue that this rate of convergence is useful for analyzing weakly supervised learning algorithms that build on MPE. To illustrate this idea, we examine an algorithm for classification in the presence of noisy labels based on surrogate risk minimization, and show that the rate of convergence for MPE enables proof of the algorithm’s consistency. Finally, we provide a practical implementation of mixture proportion estimation and demonstrate its efficacy in classification with noisy labels.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/scott15.html
https://proceedings.mlr.press/v38/scott15.htmlSensor Selection for Crowdsensing Dynamical SystemsWe model crowdsensing as the selection of sensors with unknown variance to monitor a large linear dynamical system. To achieve low estimation error, we propose a Thompson sampling approach combining submodular optimization and a scalable online variational inference algorithm to maintain the posterior distribution over the variance. We also consider three alternative parameter estimation algorithms. We illustrate the behavior of our sensor selection algorithms on real traffic data from the city of Dublin. Our online algorithm achieves significantly lower estimation error than sensor selection using a fixed variance value for all sensors.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/schnitzler15.html
https://proceedings.mlr.press/v38/schnitzler15.htmlNon-Uniform Stochastic Average Gradient Method for Training Conditional Random FieldsWe apply stochastic average gradient (SAG) algorithms for training conditional random fields (CRFs). We describe a practical implementation that uses structure in the CRF gradient to reduce the memory requirement of this linearly-convergent stochastic gradient method, propose a non-uniform sampling scheme that substantially improves practical performance, and analyze the rate of convergence of the SAGA variant under non-uniform sampling. Our experimental results reveal that our method significantly outperforms existing methods in terms of the training objective, and performs as well or better than optimally-tuned stochastic gradient methods in terms of test error.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/schmidt15.html
https://proceedings.mlr.press/v38/schmidt15.htmlDirect Density-Derivative Estimation and Its Application in KL-Divergence ApproximationEstimation of density derivatives is a versatile tool in statistical data analysis. A naive approach is to first estimate the density and then compute its derivative. However, such a two-step approach does not work well because a good density estimator does not necessarily mean a good density-derivative estimator. In this paper, we give a direct method to approximate the density derivative without estimating the density itself. Our proposed estimator allows analytic and computationally efficient approximation of multi-dimensional high-order density derivatives, with the ability that all hyper-parameters can be chosen objectively by cross-validation. We further show that the proposed density-derivative estimator is useful in improving the accuracy of non-parametric KL-divergence estimation via metric learning. The practical superiority of the proposed method is experimentally demonstrated in change detection and feature selection.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/sasaki15.html
https://proceedings.mlr.press/v38/sasaki15.htmlGamma Processes, Stick-Breaking, and Variational InferenceWhile most Bayesian nonparametric models in machine learning have focused on the Dirichlet process, the beta process, or their variants, the gamma process has recently emerged as a useful nonparametric prior in its own right. Current inference schemes for models involving the gamma process are restricted to MCMC-based methods, which limits their scalability. In this paper, we present a variational inference framework for models involving gamma process priors. Our approach is based on a novel stick-breaking constructive definition of the gamma process. We prove correctness of this stick-breaking process by using the characterization of the gamma process as a completely random measure (CRM), and we explicitly derive the rate measure of our construction using Poisson process machinery. We also derive error bounds on the truncation of the infinite process required for variational inference, similar to the truncation analyses for other nonparametric models based on the Dirichlet and beta processes. Our representation is then used to derive a variational inference algorithm for a particular Bayesian nonparametric latent structure formulation known as the infinite Gamma-Poisson model, where the latent variables are drawn from a gamma process prior with Poisson likelihoods. Finally, we present results for our algorithm on non-negative matrix factorization tasks on document corpora, and show that we compare favorably to both sampling-based techniques and variational approaches based on beta-Bernoulli priors, as well as a direct DP-based construction of the gamma process.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/roychowdhury15.html
https://proceedings.mlr.press/v38/roychowdhury15.htmlLearning Efficient Anomaly Detectors from K-NN GraphsWe propose a non-parametric anomaly detection algorithm for high dimensional data. We score each datapoint by its average K-NN distance, and rank them accordingly. We then train limited complexity models to imitate these scores based on the max-margin learning-to-rank framework. A test-point is declared as an anomaly at α-false alarm level if the predicted score is in the α-percentile. The resulting anomaly detector is shown to be asymptotically optimal in that for any false alarm rate α, its decision region converges to the α-percentile minimum volume level set of the unknown underlying density. In addition, we test both the statistical performance and computational efficiency of our algorithm on a number of synthetic and real-data experiments. Our results demonstrate the superiority of our algorithm over existing K-NN based anomaly detection algorithms, with significant computational savings.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/root15.html
https://proceedings.mlr.press/v38/root15.htmlA Scalable Algorithm for Structured Kernel Feature SelectionKernel methods are powerful tools for nonlinear feature representation. Incorporated with structured LASSO, the kernelized structured LASSO is an effective feature selection approach that can preserve the nonlinear input-output relationships as well as the structured sparseness. But as the data dimension increases, the method can quickly become computationally prohibitive. In this paper we propose a stochastic optimization algorithm that can efficiently address this computational problem on account of the redundant kernel representations of the given data. Experiments on simulation data and PET 3D brain image data show that our method can achieve superior accuracy with less computational cost than existing methods.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/ren15.html
https://proceedings.mlr.press/v38/ren15.htmlOn the High Dimensional Power of a Linear-Time Two Sample Test under Mean-shift AlternativesNonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textitgeneral alternatives), and those which are designed to specifically test easier alternatives, like a difference in means (\textitmean-shift alternatives). The main contribution of this paper is to explicitly characterize the power of a popular nonparametric two sample test, designed for general alternatives, under a mean-shift alternative in the high-dimensional setting. Specifically, we explicitly derive the power of the linear-time Maximum Mean Discrepancy statistic using the Gaussian kernel, where the dimension and sample size can both tend to infinity at any rate, and the two distributions differ in their means. As a corollary, we find that if the signal-to-noise ratio is held constant, then the test’s power goes to one if the number of samples increases faster than the dimension increases. This is the first explicit power derivation for a general nonparametric test in the high-dimensional setting, and the first analysis of how tests designed for general alternatives perform against easier ones.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/reddi15.html
https://proceedings.mlr.press/v38/reddi15.htmlDeep Exponential FamiliesWe describe deep exponential families (DEFs), a class of latent variable models that are inspired by the hidden structures used in deep neural networks. DEFs capture a hierarchy of dependencies between latent variables, and are easily generalized to many settings through exponential families. We perform inference using recent “black box" variational inference techniques. We then evaluate various DEFs on text and combine multiple DEFs into a model for pairwise recommendation data. In an extensive study, we show that going beyond one layer improves predictions for DEFs. We demonstrate that DEFs find interesting exploratory structure in large data sets, and give better predictive performance than state-of-the-art models.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/ranganath15.html
https://proceedings.mlr.press/v38/ranganath15.htmlRobust sketching for multiple square-root LASSO problemsMany learning tasks, such as cross-validation, parameter search, or leave-one-out analysis, involve multiple instances of similar problems, each instance sharing a large part of learning data with the others. We introduce a robust framework for solving multiple square-root LASSO problems, based on a sketch of the learning data that uses low-rank approximations. Our approach allows a dramatic reduction in computational effort, in effect reducing the number of observations from m (the number of observations to start with) to k (the number of singular values retained in the low-rank model), while not sacrificing—sometimes even improving—the statistical performance. Theoretical analysis, as well as numerical experiments on both synthetic and real data, illustrate the efficiency of the method in large scale applications.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/pham15.html
https://proceedings.mlr.press/v38/pham15.htmlOn Theoretical Properties of Sum-Product NetworksSum-product networks (SPNs) are a promising avenue for probabilistic modeling and have been successfully applied to various tasks. However, some theoretic properties about SPNs are not yet well understood. In this paper we fill some gaps in the theoretic foundation of SPNs. First, we show that the weights of any complete and consistent SPN can be transformed into locally normalized weights without changing the SPN distribution. Second, we show that consistent SPNs cannot model distributions significantly (exponentially) more compactly than decomposable SPNs. As a third contribution, we extend the inference mechanisms known for SPNs with finite states to generalized SPNs with arbitrary input distributions.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/peharz15.html
https://proceedings.mlr.press/v38/peharz15.htmlFeature Selection for Linear SVM with Provable GuaranteesWe give two provably accurate feature-selection techniques for the linear SVM. The algorithms run in deterministic and randomized time respectively. Our algorithms can be used in an unsupervised or supervised setting. The supervised approach is based on sampling features from support vectors. We prove that the margin in the feature space is preserved to within ε-relative error of the margin in the full feature space in the worst-case. In the unsupervised setting, we also provide worst-case guarantees of the radius of the minimum enclosing ball, thereby ensuring comparable generalization as in the full feature space and resolving an open problem posed in Dasgupta et al. We present extensive experiments on real-world datasets to support our theory and to demonstrate that our methods are competitive and often better than prior state-of-the-art, for which there are no known provable guarantees.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/paul15.html
https://proceedings.mlr.press/v38/paul15.htmlReactive bandits with attitudeWe consider a general class of K-armed bandits that adapt to the actions of the player. A single continuous parameter characterizes the “attitude” of the bandit, ranging from stochastic to cooperative or to fully adversarial in nature. The player seeks to maximize the expected return from the adaptive bandit, and the associated optimization problem is related to the free energy of a statistical mechanical system under an external field. When the underlying stochastic distribution is Gaussian, we derive an analytic solution for the long run optimal player strategy for different regimes of the bandit. In the fully adversarial limit, this solution is equivalent to the Nash equilibrium of a two-player, zero-sum semi-infinite game. We show how optimal strategies can be learned from sequential draws and reward observations in these adaptive bandits using Bayesian filtering and Thompson sampling. Results show the qualitative difference in policy pseudo-regret between our proposed strategy and other well-known bandit algorithms.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/ortega15.html
https://proceedings.mlr.press/v38/ortega15.htmlFast Function to Function RegressionWe analyze the problem of regression when both input covariates and output responses are functions from a nonparametric function class. Function to function regression (FFR) covers a large range of interesting applications including time-series prediction problems, and also more general tasks like studying a mapping between two separate types of distributions. However, previous nonparametric estimators for FFR type problems scale badly computationally with the number of input/output pairs in a data-set. Given the complexity of a mapping between general functions it may be necessary to consider large data-sets in order to achieve a low estimation risk. To address this issue, we develop a novel scalable nonparametric estimator, the Triple-Basis Estimator (3BE), which is capable of operating over datasets with many instances. To the best of our knowledge, the 3BE is the first nonparametric FFR estimator that can scale to massive data-sets. We analyze the 3BE’s risk and derive an upperbound rate. Furthermore, we show an improvement of several orders of magnitude in terms of prediction speed and a reduction in error over previous estimators in various real-world data-sets.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/oliva15.html
https://proceedings.mlr.press/v38/oliva15.htmlVariance Reduction via Antithetic Markov ChainsWe present a Monte Carlo integration method, antithetic Markov chain sampling (AMCS), that incorporates local Markov transitions in an underlying importance sampler. Like sequential Monte Carlo sampling, the proposed method uses a sequence of Markov transitions to adapt the sampling to favour more influential regions of the integrand (modes). However, AMCS differs in the type of transitions that may be used, the number of Markov chains, and the method of chain termination. In particular, from each point sampled from an initial proposal, AMCS collects a sequence of points by simulating two independent, but antithetic, Markov chains, each terminated by a sample-dependent stopping rule. This approach provides greater flexibility for targeting influential areas while eliminating the need to fix the length of the Markov chain a priori. We show that the resulting estimator is unbiased and can reduce variance on peaked multi-modal integrands that challenge existing methods.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/neufeld15.html
https://proceedings.mlr.press/v38/neufeld15.htmlEfficient Training of Structured SVMs via Soft ConstraintsStructured output prediction is a powerful framework for jointly predicting interdependent output labels. Learning the parameters of structured predictors is a central task in machine learning applications. However, training the model from data often becomes computationally expensive. Several methods have been proposed to exploit the model structure, or decomposition, in order to obtain efficient training algorithms. In particular, methods based on linear programming relaxation, or dual decomposition, decompose the prediction task into multiple simpler prediction tasks and enforce agreement between overlapping predictions. In this work we observe that relaxing these agreement constraints and replacing them with soft constraints yields a much easier optimization problem. Based on this insight we propose an alternative training objective, analyze its theoretical properties, and derive an algorithm for its optimization. Our method, based on the Frank-Wolfe algorithm, achieves significant speedups over existing state-of-the-art methods without hurting prediction accuracy.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/meshi15.html
https://proceedings.mlr.press/v38/meshi15.htmlA Spectral Algorithm for Inference in Hidden semi-Markov ModelsHidden semi-Markov models (HSMMs) are latent variable models which allow latent state persistence and can be viewed as a generalization of the popular hidden Markov models (HMMs). In this paper, we introduce a novel spectral algorithm to perform inference in HSMMs. Our approach is based on estimating certain sample moments, whose order depends only logarithmically on the maximum length of the hidden state persistence. Moreover, the algorithm requires only a few spectral decompositions and is therefore computationally efficient. Empirical evaluations on synthetic and real data demonstrate the promise of the algorithm.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/melnyk15.html
https://proceedings.mlr.press/v38/melnyk15.htmlThe Security of Latent Dirichlet AllocationLatent Dirichlet allocation (LDA) is an increasingly popular tool for data analysis in many domains. If LDA output affects decision making (especially when money is involved), there is an incentive for attackers to compromise it. We ask the question: how can an attacker minimally poison the corpus so that LDA produces topics that the attacker wants the LDA user to see? Answering this question is important to characterize such attacks, and to develop defenses in the future. We give a novel bilevel optimization formulation to identify the optimal poisoning attack. We present an efficient solution (up to local optima) using descent method and implicit functions. We demonstrate poisoning attacks on LDA with extensive experiments, and discuss possible defenses.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/mei15.html
https://proceedings.mlr.press/v38/mei15.htmlActive Pointillistic Pattern SearchWe introduce the problem of active pointillistic pattern search (APPS), which seeks to discover regions of a domain exhibiting desired behavior with limited observations. Unusually, the patterns we consider are defined by large-scale properties of an underlying function that we can only observe at a limited number of points. Given a description of the desired patterns (in the form of a classifier taking functional inputs), we sequentially decide where to query function values to identify as many regions matching the pattern as possible, with high confience. For one broad class of models the expected reward of each unobserved point can be computed analytically. We demonstrate the proposed algorithm on three difficult search problems: locating polluted regions in a lake via mobile sensors, forecasting winning electoral districts with minimal polling, and identifying vortices in a fluid flow simulation.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/ma15.html
https://proceedings.mlr.press/v38/ma15.htmlTradeoffs for Space, Time, Data and Risk in Unsupervised LearningFaced with massive data, is it possible to trade off (statistical) risk, and (computational) space and time? This challenge lies at the heart of large-scale machine learning. Using k-means clustering as a prototypical unsupervised learning problem, we show how we can strategically summarize the data (control space) in order to trade off risk and time when data is generated by a probabilistic model. Our summarization is based on coreset constructions from computational geometry. We also develop an algorithm, TRAM, to navigate the space/time/data/risk tradeoff in practice. In particular, we show that for a fixed risk (or data size), as the data size increases (resp. risk increases) the running time of TRAM decreases. Our extensive experiments on real data sets demonstrate the existence and practical utility of such tradeoffs, not only for k-means but also for Gaussian Mixture Models.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/lucic15.html
https://proceedings.mlr.press/v38/lucic15.htmlSimilarity Learning for High-Dimensional Sparse DataA good measure of similarity between data points is crucial to many tasks in machine learning. Similarity and metric learning methods learn such measures automatically from data, but they do not scale well respect to the dimensionality of the data. In this paper, we propose a method that can learn efficiently similarity measure from high-dimensional sparse data. The core idea is to parameterize the similarity measure as a convex combination of rank-one matrices with specific sparsity structures. The parameters are then optimized with an approximate Frank-Wolfe procedure to maximally satisfy relative similarity constraints on the training data. Our algorithm greedily incorporates one pair of features at a time into the similarity measure, providing an efficient way to control the number of active features and thus reduce overfitting. It enjoys very appealing convergence guarantees and its time and memory complexity depends on the sparsity of the data instead of the dimension of the feature space. Our experiments on real-world high-dimensional datasets demonstrate its potential for classification, dimensionality reduction and data exploration.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/liu15.html
https://proceedings.mlr.press/v38/liu15.htmlSparsistency of \ell_1-Regularized M-EstimatorsWe consider the model selection consistency or sparsistency of a broad set of \ell_1-regularized M-estimators for linear and non-linear statistical models in a unified fashion. For this purpose, we propose the local structured smoothness condition (LSSC) on the loss function. We provide a general result giving deterministic sufficient conditions for sparsistency in terms of the regularization parameter, ambient dimension, sparsity level, and number of measurements. We show that several important statistical models have M-estimators that indeed satisfy the LSSC, and as a result, the sparsistency guarantees for the corresponding \ell_1-regularized M-estimators can be derived as simple applications of our main theorem.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/li15f.html
https://proceedings.mlr.press/v38/li15f.htmlConditional Restricted Boltzmann Machines for Multi-label Learning with Incomplete LabelsStandard multi-label learning methods assume fully labeled training data. This assumption however is impractical in many application domains where labels are difficult to collect and missing labels are prevalent. In this paper, we develop a novel conditional restricted Boltzmann machine model to address multi-label learning with incomplete labels. It uses a restricted Boltzmann machine to capture the high-order label dependence relationships in the output space, aiming to enhance the capacity of recovering missing labels and learning high quality multi-label prediction models. Moreover, it also incorporates label co-occurrence information retrieved from auxiliary resources as prior knowledge. We perform model training by maximizing the regularized marginal conditional likelihood of the label vectors given the input features, and develop a Viterbi style EM algorithm to solve the induced optimization problem. The proposed approach is evaluated on four real word multi-label data sets by comparing to a number of state-of-the-art methods. The experimental results show it outperforms all the other comparison methods across the applied data sets.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/li15e.html
https://proceedings.mlr.press/v38/li15e.htmlMax-Margin Zero-Shot Learning for Multi-class ClassificationDue to the dramatic expanse of data categories and the lack of labeled instances, zero-shot learning, which transfers knowledge from observed classes to recognize unseen classes, has started drawing a lot of attention from the research community. In this paper, we propose a semi-supervised max-margin learning framework that integrates the semi-supervised classification problem over observed classes and the unsupervised clustering problem over unseen classes together to tackle zero-shot multi-class classification. By further integrating label embedding into this framework, we produce a dual formulation that permits convenient incorporation of auxiliary label semantic knowledge to improve zero-shot learning. We conduct extensive experiments on three standard image data sets to evaluate the proposed approach by comparing to two state-of-the-art methods. Our results demonstrate the efficacy of the proposed framework.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/li15d.html
https://proceedings.mlr.press/v38/li15d.htmlCompressed Sensing with Very Sparse Gaussian Random Projections\noindent We study the use of \em very sparse random projections \citeProc:Li_Hastie_Church_KDD06,Proc:Li_KDD07 for compressed sensing (sparse signal recovery) when the nonzero coordinates of signals can be either positive or negative. In our setting, the entries of a Gaussian design matrix are randomly sparsified so that only a very small fraction of entries are nonzero. Our proposed decoding algorithm is simple and efficient in that the major cost is one linear scan of the coordinates. Using our proposed \textbf\em “tie estimator”, we are able to recover a K-sparse signal of length N using 1.551 eK \log K/δmeasurements (where δ≤0.05 is the confidence) in one scan. The practical performance of our method, however, can be substantially better than this bound. The Gaussian design assumption is not essential although it simplifies the analysis. \noindent Prior studies have shown that existing one-scan (or roughly one-scan) recovery algorithms using sparse matrices would require substantially (e.g., one order of magnitude) more measurements than L1 decoding by linear programming, when the nonzero coordinates of signals can be either negative or positive. In this paper, following a well-known experimental setup \citeUrl:wiki_sparse, we show that, at the same number of measurements, the recovery accuracies of our proposed method are similar to the standard L1 decoding.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/li15c.html
https://proceedings.mlr.press/v38/li15c.htmlToward Minimax Off-policy Value EstimationThis paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators. It is shown that in a large class of settings the so-called regression estimator is minimax optimal up to a constant that depends on the number of actions, while the other two can be arbitrarily worse even in the limit of infinitely many data points, despite their empirical success and popularity. The performance of these estimators are studied in synthetic and real problems; illustrating the nontriviality of this simple task. Finally the results are extended to the problem of off-policy evaluation in contextual bandits and fixed-horizon Markov decision processes.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/li15b.html
https://proceedings.mlr.press/v38/li15b.htmlScalable Optimization of Randomized Operational Decisions in Adversarial Classification SettingsWhen learning, such as classification, is used in adversarial settings, such as intrusion detection, intelligent adversaries will attempt to evade the resulting policies. The literature on adversarial machine learning aims to develop learning algorithms which are robust to such adversarial evasion, but exhibits two significant limitations: a) failure to account for operational constraints and b) a restriction that decisions are deterministic. To overcome these limitations, we introduce a conceptual separation between learning, used to infer attacker preferences, and operational decisions, which account for adversarial evasion, enforce operational constraints, and naturally admit randomization. Our approach gives rise to an intractably large linear program. To overcome scalability limitations, we introduce a novel method for estimating a compact parity basis representation for the operational decision function. Additionally, we develop an iterative constraint generation approach which embeds adversary’s best response calculation, to arrive at a scalable algorithm for computing near-optimal randomized operational decisions. Extensive experiments demonstrate the efficacy of our approach.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/li15a.html
https://proceedings.mlr.press/v38/li15a.htmlModelling Policies in MDPs in Reproducing Kernel Hilbert SpaceWe consider modelling policies for MDPs in (vector-valued) reproducing kernel Hilbert function spaces (RKHS). This enables us to work “non-parametrically” in a rich function class, and provides the ability to learn complex policies. We present a framework for performing gradient-based policy optimization in the RKHS, deriving the functional gradient of the return for our policy, which has a simple form and can be estimated efficiently. The policy representation naturally focuses on the relevant region of state space defined by the policy trajectories, and does not rely on a-priori defined basis points; this can be an advantage in high dimensions where suitable basis points may be difficult to define a-priori. The method is adaptive in the sense that the policy representation will naturally adapt to the complexity of the policy being modelled, which is achieved with standard efficient sparsification tools in an RKHS. We argue that finding a good kernel on states can be easier then remetrizing a high dimensional feature space. We demonstrate the approach on benchmark domains and a simulated quadrocopter navigation task.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/lever15.html
https://proceedings.mlr.press/v38/lever15.htmlBayesian Hierarchical Clustering with Exponential Family: Small-Variance Asymptotics and ReducibilityBayesian hierarchical clustering (BHC) is an agglomerative clustering method, where a probabilistic model is defined and its marginal likelihoods are evaluated to decide which clusters to merge. While BHC provides a few advantages over traditional distance-based agglomerative clustering algorithms, successive evaluation of marginal likelihoods and careful hyperparameter tuning are cumbersome and limit the scalability. In this paper we relax BHC into a non-probabilistic formulation, exploring smallvariance asymptotics in conjugate-exponential models. We develop a novel clustering algorithm, referred to as relaxed BHC (RBHC), from the asymptotic limit of the BHC model that exhibits the scalability of distance-based agglomerative clustering algorithms as well as the flexibility of Bayesian nonparametric models. We also investigate the reducibility of the dissimilarity measure emerged from the asymptotic limit of the BHC model, allowing us to use scalable algorithms such as the nearest neighbor chain algorithm. Numerical experiments on both synthetic and real-world datasets demonstrate the validity and high performance of our method.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/lee15c.html
https://proceedings.mlr.press/v38/lee15c.htmlPreferential Attachment in Graphs with AffinitiesPreferential attachment models for random graphs are successful in capturing many characteristics of real networks such as power law behavior. At the same time they lack flexibility to take vertex to vertex affinities into account, a feature that is commonly used in many link recommendation algorithms. We propose a random graph model based on both node attributes and preferential attachment. This approach overcomes the limitation of existing models on expressing vertex affinity and on reflecting properties of different subgraphs. We analytically prove that our model preserves the power law behavior in the degree distribution as expressed by natural graphs and we show that it satisfies the small world property. Experiments show that our model provides an excellent fit of many natural graph statistics and we provide an algorithm to infer the associated affinity function efficiently.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/lee15b.html
https://proceedings.mlr.press/v38/lee15b.htmlDeeply-Supervised NetsWe propose deeply-supervised nets (DSN), a method that simultaneously minimizes classification error and improves the directness and transparency of the hidden layer learning process. We focus our attention on three aspects of traditional convolutional-neural-network-type (CNN-type) architectures: (1) transparency in the effect intermediate layers have on overall classification; (2) discriminativeness and robustness of learned features, especially in early layers; (3) training effectiveness in the face of “vanishing” gradients. To combat these issues, we introduce “companion” objective functions at each hidden layer, in addition to the overall objective function at the output layer (an integrated strategy distinct from layer-wise pre-training). We also analyze our algorithm using techniques extended from stochastic gradient methods. The advantages provided by our method are evident in our experimental results, showing state-of-the-art performance on MNIST, CIFAR-10, CIFAR-100, and SVHN.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/lee15a.html
https://proceedings.mlr.press/v38/lee15a.htmlParticle Gibbs for Bayesian Additive Regression TreesAdditive regression trees are flexible non-parametric models and popular off-the-shelf tools for real-world non-linear regression. In application domains, such as bioinformatics, where there is also demand for probabilistic predictions with measures of uncertainty, the Bayesian additive regression trees (BART) model, introduced by Chipman et al. (2010), is increasingly popular. As data sets have grown in size, however, the standard Metropolis–Hastings algorithms used to per- form inference in BART are proving inadequate. In particular, these Markov chains make local changes to the trees and suffer from slow mixing when the data are high- dimensional or the best-fitting trees are more than a few layers deep. We present a novel sampler for BART based on the Particle Gibbs (PG) algorithm (Andrieu et al., 2010) and a top-down particle filtering algorithm for Bayesian decision trees (Lakshminarayanan et al., 2013). Rather than making local changes to individual trees, the PG sampler proposes a complete tree to fit the residual. Experiments show that the PG sampler outperforms existing samplers in many settings.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/lakshminarayanan15.html
https://proceedings.mlr.press/v38/lakshminarayanan15.htmlSequential Kernel Herding: Frank-Wolfe Optimization for Particle FilteringRecently, the Frank-Wolfe optimization algorithm was suggested as a procedure to obtain adaptive quadrature rules for integrals of functions in a reproducing kernel Hilbert space (RKHS) with a potentially faster rate of convergence than Monte Carlo integration (and “kernel herding” was shown to be a special case of this procedure). In this paper, we propose to replace the random sampling step in a particle filter by Frank-Wolfe optimization. By optimizing the position of the particles, we can obtain better accuracy than random or quasi-Monte Carlo sampling. In applications where the evaluation of the emission probabilities is expensive (such as in robot localization), the additional computational cost to generate the particles through optimization can be justified. Experiments on standard synthetic examples as well as on a robot localization task indicate indeed an improvement of accuracy over random and quasi-Monte Carlo sampling.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/lacoste-julien15.html
https://proceedings.mlr.press/v38/lacoste-julien15.htmlTight Regret Bounds for Stochastic Combinatorial Semi-BanditsA stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In particular, we analyze a UCB-like algorithm for solving the problem, which is known to be computationally efficient; and prove O(K L (1 / ∆) \log n) and O(\sqrtK L n \log n) upper bounds on its n-step regret, where L is the number of ground items, K is the maximum number of chosen items, and ∆is the gap between the expected returns of the optimal and best suboptimal solutions. The gap-dependent bound is tight up to a constant factor and the gap-free bound is tight up to a polylogarithmic factor.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/kveton15.html
https://proceedings.mlr.press/v38/kveton15.htmlSymmetric Iterative Proportional FittingIterative Proportional Fitting (IPF) generates from an input matrix W a sequence of matrices that converges, under certain conditions, to a specific limit matrix W*. This limit is the relative-entropy nearest solution to W among all matrices of prescribed row marginals r and column marginals c. We prove this known fact by a novel strategy that contributes a pure algorithmic intuition. Then we focus on the symmetric setting: W=W’ and r=c. Since IPF inherently generates non-symmetric matrices, we introduce two symmetrized variants of IPF. We prove convergence for both of them. Further, we give a novel characterization for the existence of W* in terms of expansion properties of the undirected weighted graph represented by W. Finally, we show how our results contribute to recent work in machine learning.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/kurras15.html
https://proceedings.mlr.press/v38/kurras15.htmlLow-Rank Spectral Learning with Weighted Loss FunctionsKulesza et al. recently observed that low-rank spectral learning algorithms, which discard the smallest singular values of a moment matrix during training, can behave in unexpected ways, producing large errors even when the discarded singular values are arbitrarily small. In this paper we prove that when learning predictive state representations those problematic cases disappear if we introduce a particular weighted loss function and learn using sufficiently large sets of statistics; our main result is a bound on the loss of the learned low-rank model in terms of the singular values that are discarded. Practically speaking, this suggests that regardless of the model rank we should use the largest possible sets of statistics, and we show empirically that this is true on both synthetic and real-world domains.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/kulesza15.html
https://proceedings.mlr.press/v38/kulesza15.htmlTensor Factorization via Matrix FactorizationTensor factorization arises in many machine learning applications, such as knowledge base modeling and parameter estimation in latent variable models. However, numerical methods for tensor factorization have not reached the level of maturity of matrix factorization methods. In this paper, we propose a new algorithm for CP tensor factorization that uses random projections to reduce the problem to simultaneous matrix diagonalization. Our method is conceptually simple and also applies to non-orthogonal and asymmetric tensors of arbitrary order. We prove that a small number random projections essentially preserves the spectral information in the tensor, allowing us to remove the dependence on the eigengap that plagued earlier tensor-to-matrix reductions. Experimentally, our method outperforms existing tensor factorization methods on both simulated data and two real datasets.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/kuleshov15.html
https://proceedings.mlr.press/v38/kuleshov15.htmlOn Estimating L_2^2 DivergenceWe give a comprehensive theoretical characterization of a nonparametric estimator for the L_2^2 divergence between two continuous distributions. We first bound the rate of convergence of our estimator, showing that it is \sqrtn-consistent provided the densities are sufficiently smooth. In this smooth regime, we then show that our estimator is asymptotically normal, construct asymptotic confidence intervals, and establish a Berry-Esséen style inequality characterizing the rate of convergence to normality. We also show that this estimator is minimax optimal.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/krishnamurthy15.html
https://proceedings.mlr.press/v38/krishnamurthy15.htmlDART: Dropouts meet Multiple Additive Regression TreesMART, an ensemble model of boosted regression trees, is known to deliver high prediction accuracy for diverse tasks, and is widely used in practice. However, it suffers an issue which we call over-specialization, wherein trees added at later iterations tend to impact the prediction of only a few instances, and make negligible contribution towards the remaining instances. This negatively affects the performance of the model on unseen data, and also makes the model over-sensitive to the contributions of the few, initially added tress. We show that the commonly used tool to address this issue, that of shrinkage, alleviates the problem only to a certain extent and the fundamental issue of over-specialization still remains. In this work, we explore a different approach to address the problem, that of employing dropouts, a tool that has been recently proposed in the context of learning deep neural networks. We propose a novel way of employing dropouts to tackle the issue of over-specialization in MART, resulting in the DART algorithm. We evaluate DART on ranking, regression and classification tasks, using large scale, publicly available datasets, and show that DART outperforms MART in each of the tasks, with a significant margin.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/korlakaivinayak15.html
https://proceedings.mlr.press/v38/korlakaivinayak15.htmlA Bayes consistent 1-NN classifierWe show that a simple modification of the 1-nearest neighbor classifier yields a strongly Bayes consistent learner. Prior to this work, the only strongly Bayes consistent proximity-based method was the k-nearest neighbor classifier, for k growing appropriately with sample size. We will argue that a margin-regularized 1-NN enjoys considerable statistical and algorithmic advantages over the k-NN classifier. These include user-friendly finite-sample error bounds, as well as time- and memory-efficient learning and test-point evaluation algorithms with a principled speed-accuracy tradeoff. Encouraging empirical results are reported.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/kontorovich15.html
https://proceedings.mlr.press/v38/kontorovich15.htmlDimensionality estimation without distancesWhile existing methods for estimating the intrinsic dimension of datasets require to know distances between data points, we consider a situation where one has considerably less information. Given a sample of points, all we get to see is who are the k nearest neighbors of every point. In other words, we get the adjacency matrix of the directed, unweighted k-nearest neighbor graph on the sample, but do not know any point coordinates or distances between the points. We provide two estimators for this situation, a naive one and a more elaborate one. Both of them can be proved to be statistically consistent. However, further theoretical and experimental evidence shows that the naive estimator performs rather poorly, whereas the elaborate one achieves results comparable to those of estimators based on distance information.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/kleindessner15.html
https://proceedings.mlr.press/v38/kleindessner15.htmlLatent feature regression for multivariate count dataWe consider the problem of regression on multivariate count data and present a Gibbs sampler for a latent feature regression model suitable for both under- and overdispersed response variables. The model learns count-valued latent features conditional on arbitrary covariates, modeling them as negative binomial variables, and maps them into the dependent count-valued observations using a Dirichlet-multinomial distribution. From another viewpoint, the model can be seen as a generalization of a specific topic model for scenarios where we are interested in generating the actual counts of observations and not just their relative frequencies and co-occurrences. The model is demonstrated on a smart traffic application where the task is to predict public transportation volume for unknown locations based on a characterization of the close-by services and venues.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/klami15.html
https://proceedings.mlr.press/v38/klami15.htmlSparse Submodular Probabilistic PCAWe propose a novel approach for sparse probabilistic principal component analysis, that combines a low rank representation for the latent factors and loadings with a novel sparse variational inference approach for estimating distributions of latent variables subject to sparse support constraints. Inference and parameter estimation for the resulting model is achieved via expectation maximization with a novel variational inference method for the E-step that induces sparsity. We show that this inference problem can be reduced to discrete optimal support selection. The discrete optimization is submodular, hence, greedy selection is guaranteed to achieve 1-1/e fraction of the optimal. Empirical studies indicate effectiveness of the proposed approach for the recovery of a parsimonious decomposition as compared to established baseline methods. We also evaluate our method against state-of-the-art methods on high dimensional fMRI data, and show that the method performs as good as or better than other methods.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/khanna15.html
https://proceedings.mlr.press/v38/khanna15.htmlOn Approximate Non-submodular Minimization via Tree-Structured SupermodularityWe address the problem of minimizing non-submodular functions where the supermodularity is restricted to tree-structured pairwise terms. We are motivated by several real world applications, which require submodularity along with structured supermodularity, and this forms a rich class of expressive models, where the non-submodularity is restricted to a tree. While this problem is NP hard (as we show), we develop several practical algorithms to find approximate and near-optimal solutions for this problem, some of which provide lower and others of which provide upper bounds thereby allowing us to compute a tightness gap. We also show that some of our algorithms can be extended to handle more general forms of supermodularity restricted to arbitrary pairwise terms. We compare our algorithms on synthetic data, and also demonstrate the advantage of the formulation on the real world application of image segmentation, where we incorporate structured supermodularity into higher-order submodular energy minimization.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/kawahara15.html
https://proceedings.mlr.press/v38/kawahara15.htmlRobust Cost Sensitive Support Vector MachineIn this paper we consider robust classifications and show equivalence between the regularized classifications. In general, robust classifications are used to create a classifier robust to data by taking into account the uncertainty of the data. Our result shows that regularized classifications inherit robustness and provide reason on why some regularized classifications tend to be robust against data. Although most robust classification problems assume that every uncertain data lie within an identical bounded set, this paper considers a generalized model where the sizes of the bounded sets are different for each data. These models can be transformed into regularized classification models where the penalties for each data are assigned according to their losses. We see that considering such models opens up for new applications. For an example, we show that this robust classification technique can be used for Imbalanced Data Learning. We conducted experimentation with actual data and compared it with other IDL algorithms such as Cost Sensitive SVMs. This is a novel usage for the robust classification scheme and encourages it to be a suitable candidate for imbalanced data learning.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/katsumata15.html
https://proceedings.mlr.press/v38/katsumata15.htmlConsensus Message Passing for Layered Graphical ModelsGenerative models provide a powerful framework for probabilistic reasoning. However, in many domains their use has been hampered by the practical difficulties of inference. This is particularly the case in computer vision, where models of the imaging process tend to be large, loopy and layered. For this reason bottom-up conditional models have traditionally dominated in such domains. We find that widely-used, general-purpose message passing inference algorithms such as Expectation Propagation (EP) and Variational Message Passing (VMP) fail on the simplest of vision models. With these models in mind, we introduce a modification to message passing that learns to exploit their layered structure by passing ’consensus’ messages that guide inference towards good solutions. Experiments on a variety of problems show that the proposed technique leads to significantly more accurate inference results, not only when compared to standard EP and VMP, but also when compared to competitive bottom-up conditional models.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/jampani15.html
https://proceedings.mlr.press/v38/jampani15.htmlSparse Dueling BanditsThe dueling bandit problem is a variation of the classical multi-armed bandit in which the allowable actions are noisy comparisons between pairs of arms. This paper focuses on a new approach for finding the best arm according to the Borda criterion using noisy comparisons. We prove that in the absence of structural assumptions, the sample complexity of this problem is proportional to the sum of the inverse gaps squared of the Borda scores of each arm. We explore this dependence further and consider structural constraints on the pairwise comparison matrix (a particular form of sparsity natural to this problem) that can significantly reduce the sample complexity. This motivates a new algorithm called Successive Elimination with Comparison Sparsity (SECS) that exploits sparsity to find the Borda winner using fewer samples than standard algorithms. We also evaluate the new algorithm experimentally with synthetic and real data. The results show that the sparsity model and the new algorithm can provide significant improvements over standard approaches.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/jamieson15.html
https://proceedings.mlr.press/v38/jamieson15.htmlEstimating the accuracies of multiple classifiers without labeled dataIn various situations one is given only the predictions of multiple classifiers over a large unlabeled test data. This scenario raises the following questions: Without any labeled data and without any a-priori knowledge about the reliability of these different classifiers, is it possible to consistently and computationally efficiently estimate their accuracies? Furthermore, also in a completely unsupervised manner, can one construct a more accurate unsupervised ensemble classifier? In this paper, focusing on the binary case, we present simple, computationally efficient algorithms to solve these questions. Furthermore, under standard classifier independence assumptions, we prove our methods are consistent and study their asymptotic error. Our approach is spectral, based on the fact that the off-diagonal entries of the classifiers’ covariance matrix and 3-d tensor are rank-one. We illustrate the competitive performance of our algorithms via extensive experiments on both artificial and real datasets.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/jaffe15.html
https://proceedings.mlr.press/v38/jaffe15.htmlOnline Optimization : Competing with Dynamic ComparatorsRecent literature on online learning has focused on developing adaptive algorithms that take advantage of a regularity of the sequence of observations, yet retain worst-case performance guarantees. A complementary direction is to develop prediction methods that perform well against complex benchmarks. In this paper, we address these two directions together. We present a fully adaptive method that competes with dynamic benchmarks in which regret guarantee scales with regularity of the sequence of cost functions and comparators. Notably, the regret bound adapts to the smaller complexity measure in the problem environment. Finally, we apply our results to drifting zero-sum, two-player games where both players achieve no regret guarantees against best sequences of actions in hindsight.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/jadbabaie15.html
https://proceedings.mlr.press/v38/jadbabaie15.htmlSubmodular Point Processes with Applications to Machine learningWe introduce a class of discrete point processes that we call the \emphSubmodular Point Processes (SPPs). These processes are characterized via a submodular (or supermodular) function, and naturally model notions of \emphinformation, coverage and \emphdiversity, as well as \emphcooperation. Unlike Log-submodular and Log-supermodular distributions (Log-SPPs) such as determinantal point processes (DPPs), SPPs are themselves submodular (or supermodular). In this paper, we analyze the computational complexity of probabilistic inference in SPPs. We show that computing the partition function for SPPs (and Log-SPPs), requires exponential complexity in the worst case, and also provide algorithms which approximate SPPs up to polynomial factors. Moreover, for several subclasses of interesting submodular functions that occur in applications, we show how we can provide efficient closed form expressions for the partition functions, and thereby marginals and conditional distributions. We also show how SPPs are closed under mixtures, thus enabling maximum likelihood based strategies for learning mixtures of submodular functions. Finally, we argue how SPPs complement existing Log-SPP distributions, and are a natural model for several applications.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/iyer15.html
https://proceedings.mlr.press/v38/iyer15.htmlCross-domain recommendation without shared users or items by sharing latent vector distributionsWe propose a cross-domain recommendation method for predicting the ratings of items in different domains, where neither users nor items are shared across domains. The proposed method is based on matrix factorization, which learns a latent vector for each user and each item. Matrix factorization techniques for a single-domain fail in the cross-domain recommendation task because the learned latent vectors are not aligned over different domains. The proposed method assumes that latent vectors in different domains are generated from a common Gaussian distribution with a full covariance matrix. By inferring the mean and covariance of the common Gaussian from given cross-domain rating matrices, the latent factors are aligned, which enables us to predict ratings in different domains. Experiments conducted on rating datasets from a wide variety of domains, e.g., movie, books and electronics, demonstrate that the proposed method achieves higher performance for predicting cross-domain ratings than existing methods.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/iwata15.html
https://proceedings.mlr.press/v38/iwata15.htmlReliable and Scalable Variational Inference for the Hierarchical Dirichlet ProcessWe introduce a new variational inference objective for hierarchical Dirichlet process admixture models. Our approach provides novel and scalable algorithms for learning nonparametric topic models of text documents and Gaussian admixture models of image patches. Improving on the point estimates of topic probabilities used in previous work, we define full variational posteriors for all latent variables and optimize parameters via a novel surrogate likelihood bound. We show that this approach has crucial advantages for data-driven learning of the number of topics. Via merge and delete moves that remove redundant or irrelevant topics, we learn compact and interpretable models with less computation. Scaling to millions of documents is possible using stochastic or memoized variational updates.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/hughes15.html
https://proceedings.mlr.press/v38/hughes15.htmlStochastic Structured Variational InferenceStochastic variational inference makes it possible to approximate posterior distributions induced by large datasets quickly using stochastic optimization. The algorithm relies on the use of fully factorized variational distributions. However, this “mean-field” independence approximation limits the fidelity of the posterior approximation, and introduces local optima. We show how to relax the mean-field approximation to allow arbitrary dependencies between global parameters and local hidden variables, producing better parameter estimates by reducing bias, sensitivity to local optima, and sensitivity to hyperparameters.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/hoffman15.html
https://proceedings.mlr.press/v38/hoffman15.htmlScalable Variational Gaussian Process ClassificationGaussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, out-performing the state of the art on benchmark datasets. Importantly, the variational formulation an be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/hensman15.html
https://proceedings.mlr.press/v38/hensman15.htmlMetric recovery from directed unweighted graphsWe analyze directed, unweighted graphs obtained from x_i∈\RR^d by connecting vertex i to j iff |x_i - x_j| < ε(x_i). Examples of such graphs include k-nearest neighbor graphs, where ε(x_i) varies from point to point, and, arguably, many real world graphs such as co-purchasing graphs. We ask whether we can recover the underlying Euclidean metric ε(x_i) and the associated density p(x_i) given only the directed graph and d. We show that consistent recovery is possible up to isometric scaling when the vertex degree is at least ω(n^2/(2+d)\log(n)^d/(d+2)). Our estimator is based on a careful characterization of a random walk over the directed graph and the associated continuum limit. As an algorithm, it resembles the PageRank centrality metric. We demonstrate empirically that the estimator performs well on simulated examples as well as on real-world co-purchasing graphs even with a small number of points and degree scaling as low as \log(n).Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/hashimoto15.html
https://proceedings.mlr.press/v38/hashimoto15.htmlA Consistent Method for Graph Based Anomaly LocalizationThe anomaly localization task aims at detecting faulty sensors automatically by monitoring the sensor values. In this paper, we propose an anomaly localization algorithm with a consistency guarantee on its results. Although several algorithms were proposed in the last decade, the consistency of the localization results was not discussed in the literature. To the best of our knowledge, this is the first study that provides theoretical guarantees for the localization results. Our new approach is to formulate the task as solving the sparsest subgraph problem on a difference graph. Since this problem is NP-hard, we then use a convex quadratic programming approximation algorithm, which is guaranteed to be consistent under suitable conditions. Across the simulations on both synthetic and real world datasets, we verify that the proposed method achieves higher anomaly localization performance compared to existing methods.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/hara15.html
https://proceedings.mlr.press/v38/hara15.htmlPreserving Privacy of Continuous High-dimensional Data with Minimax FiltersPreserving privacy of high-dimensional and continuous data such as images or biometric data is a challenging problem. This paper formulates this problem as a learning game between three parties: 1) data contributors using a filter to sanitize data samples, 2) a cooperative data aggregator learning a target task using the filtered samples, and 3) an adversary learning to identify contributors using the same filtered samples. Minimax filters that achieve the optimal privacy-utility trade-off from broad families of filters and loss/classifiers are defined, and algorithms for learning the filers in batch or distributed settings are presented. Experiments with several real-world tasks including facial expression recognition, speech emotion recognition, and activity recognition from motion, show that the minimax filter can simultaneously achieve similar or better target task accuracy and lower privacy risk, often significantly lower than previous methods.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/hamm15.html
https://proceedings.mlr.press/v38/hamm15.htmlThe Bayesian Echo Chamber: Modeling Social Influence via Linguistic AccommodationWe present the Bayesian Echo Chamber, a new Bayesian generative model for social interaction data. By modeling the evolution of people’s language usage over time, this model discovers latent influence relationships between them. Unlike previous work on inferring influence, which has primarily focused on simple temporal dynamics evidenced via turn-taking behavior, our model captures more nuanced influence relationships, evidenced via linguistic accommodation patterns in interaction content. The model, which is based on a discrete analog of the multivariate Hawkes process, permits a fully Bayesian inference algorithm. We validate our model’s ability to discover latent influence patterns using transcripts of arguments heard by the US Supreme Court and the movie "12 Angry Men." We showcase our model’s capabilities by using it to infer latent influence patterns from Federal Open Market Committee meeting transcripts, demonstrating state-of-the-art performance at uncovering social dynamics in group discussions.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/guo15.html
https://proceedings.mlr.press/v38/guo15.htmlConsistent Collective Matrix Completion under Joint Low Rank StructureWe address the collective matrix completion problem of jointly recovering a collection of matrices with shared structure from partial (and potentially noisy) observations. To ensure well–posedness of the problem, we impose a joint low rank structure, wherein each component matrix is low rank and the latent space of the low rank factors corresponding to each entity is shared across the entire collection. We first develop a rigorous algebra for representing and manipulating collective–matrix structure, and identify sufficient conditions for consistent estimation of collective matrices. We then propose a tractable convex estimator for solving the collective matrix completion problem, and provide the first non–trivial theoretical guarantees for consistency of collective matrix completion. We show that under reasonable assumptions stated in Sec. 3.1, with high probability, the proposed estimator exactly recovers the true matrices whenever sample complexity requirements dictated by Theorem 1 are met. The sample complexity requirement derived in the paper are optimum up to logarithmic factors, and significantly improve upon the requirements obtained by trivial extensions of standard matrix completion. Finally, we propose a scalable approximate algorithm to solve the proposed convex program, and corroborate our results through simulated and real life experiments.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/gunasekar15.html
https://proceedings.mlr.press/v38/gunasekar15.htmlModeling Skill Acquisition Over Time with Sequence and Topic ModelingOnline education provides data from students solving problems at different levels of proficiency over time. Unfortunately, methods that use these data for inferring student knowledge rely on costly domain expertise. We propose three novel data-driven methods that bridge sequence modeling with topic models to infer students’ time varying knowledge. These methods differ in complexity, interpretability, accuracy and human supervision. For example, our most interpretable method has similar classification accuracy to the models created by domain experts, but requires much less effort. On the other hand, the most accurate method is completely data-driven and improves predictions by up to 15% in AUC, an evaluation metric for classifiers.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/gonzalez-brenes15.html
https://proceedings.mlr.press/v38/gonzalez-brenes15.htmlOn Anomaly Ranking and Excess-Mass CurvesLearning how to rank multivariate unlabeled observations depending on their degree of abnormality/novelty is a crucial problem in a wide range of applications. In practice, it generally consists in building a real valued "scoring" function on the feature space so as to quantify to which extent observations should be considered as abnormal. In the 1-d situation, measurements are generally considered as ”abnormal” when they are remote from central measures such as the mean or the median. Anomaly detection then relies on tail analysis of the variable of interest. Extensions to the multivariate setting are far from straightforward and it is precisely the main purpose of this paper to introduce a novel and convenient (functional) criterion for measuring the performance of a scoring function regarding the anomaly ranking task, referred to as the Excess-Mass curve (EM-curve). In addition, an adaptive algorithm for building a scoring function based on unlabeled data with a nearly optimal EM is proposed and is analyzed from a statistical perspective.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/goix15.html
https://proceedings.mlr.press/v38/goix15.htmlEfficient Estimation of Mutual Information for Strongly Dependent VariablesWe demonstrate that a popular class of non-parametric mutual information (MI) estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accurate estimation of MI between two strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the underlying joint distribution. We introduce a new estimator that is robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude. We demonstrate the superior performance of the proposed estimator on both synthetic and real-world data.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/gao15.html
https://proceedings.mlr.press/v38/gao15.htmlLearning Deep Sigmoid Belief Networks with Data AugmentationDeep directed generative models are developed. The multi-layered model is designed by stacking sigmoid belief networks, with sparsity-encouraging priors placed on the model parameters. Learning and inference of layer-wise model parameters are implemented in a Bayesian setting. By exploring the idea of data augmentation and introducing auxiliary Polya-Gamma variables, simple and efficient Gibbs sampling and mean-field variational Bayes (VB) inference are implemented. To address large-scale datasets, an online version of VB is also developed. Experimental results are presented for three publicly available datasets: MNIST, Caltech 101 Silhouettes and OCR letters.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/gan15.html
https://proceedings.mlr.press/v38/gan15.htmlComputational Complexity of Linear Large Margin Classification With Ramp LossMinimizing the binary classification error with a linear model leads to an NP-hard problem. In practice, surrogate loss functions are used, in particular loss functions leading to large margin classification such as the hinge loss and the ramp loss. The intuitive large margin concept is theoretically supported by generalization bounds linking the expected classification error to the empirical margin error and the complexity of the considered hypotheses class. This article addresses the fundamental question about the computational complexity of determining whether there is a hypotheses class with a hypothesis such that the upper bound on the generalization error is below a certain value. Results of this type are important for model comparison and selection. This paper takes a first step and proves that minimizing a basic margin-bound is NP-hard when considering linear hypotheses and the rho-margin loss function, which generalizes the ramp loss. This result directly implies the hardness of ramp loss minimization.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/frejstrupmaibing15.html
https://proceedings.mlr.press/v38/frejstrupmaibing15.htmlA Sufficient Statistics Construction of Exponential Family Lévy Measure Densities for Nonparametric Conjugate ModelsConjugate pairs of distributions over infinite dimensional spaces are prominent in machine learning, particularly due to the widespread adoption of Bayesian nonparametric methodologies for a host of models and applications. Much of the existing literature in the learning community focuses on processes possessing some form of computationally tractable conjugacy as is the case for the beta process and the gamma process (and, via normalization, the Dirichlet process). For these processes, conjugacy is proved via statistical machinery tailored to the particular model. We seek to address the problem of obtaining a general construction of prior distributions over infinite dimensional spaces possessing distributional properties amenable to conjugacy. Our result is achieved by generalizing Hjort’s construction of the beta process via appropriate utilization of sufficient statistics for exponential families.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/finn15.html
https://proceedings.mlr.press/v38/finn15.htmlGraph Approximation and Clustering on a BudgetWe consider the problem of learning from a similarity matrix (such as spectral clustering and low-dimensional embedding), when computing pairwise similarities are costly, and only a limited number of entries can be observed. We provide a theoretical analysis using standard notions of graph approximation, significantly generalizing previous results, which focused on spectral clustering with two clusters. We also propose a new algorithmic approach based on adaptive sampling, which experimentally matches or improves on previous methods, while being considerably more general and computationally cheaper.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/fetaya15.html
https://proceedings.mlr.press/v38/fetaya15.htmlBack to the Past: Source Identification in Diffusion Networks from Partially Observed CascadesWhen a piece of malicious information becomes rampant in an information diffusion network, can we identify the source node that originally introduced the piece into the network and infer the time when it initiated this? Being able to do so is critical for curtailing the spread of malicious information, and reducing the potential losses incurred. This is a very challenging problem since typically only incomplete traces are observed and we need to unroll the incomplete traces into the past in order to pinpoint the source. In this paper, we tackle this problem by developing a two-stage framework, which first learns a continuous-time diffusion network based on historical diffusion traces and then identifies the source of an incomplete diffusion trace by maximizing the likelihood of the trace under the learned model. Experiments on both large synthetic and real-world data show that our framework can effectively go back to the past, and pinpoint the source node and its initiation time significantly more accurately than previous state-of-the-arts.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/farajtabar15.html
https://proceedings.mlr.press/v38/farajtabar15.htmlA totally unimodular view of structured sparsityThis paper describes a simple framework for structured sparse recovery based on convex optimization. We show that many structured sparsity models can be naturally represented by linear matrix inequalities on the support of the unknown parameters, where the constraint matrix has a totally unimodular (TU) structure. For such structured models, tight convex relaxations can be obtained in polynomial time via linear programming. Our modeling framework unifies the prevalent structured sparsity norms in the literature, introduces new interesting ones, and renders their tightness and tractability arguments transparent.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/elhalabi15.html
https://proceedings.mlr.press/v38/elhalabi15.htmlA Topic Modeling Approach to RankingWe propose a topic modeling approach to the prediction of preferences in pairwise comparisons. We develop a new generative model for pairwise comparisons that accounts for multiple shared latent rankings that are prevalent in a population of users. This new model also captures inconsistent user behavior in a natural way. We show how the estimation of latent rankings in the new generative model can be formally reduced to the estimation of topics in a statistically equivalent topic modeling problem. We leverage recent advances in the topic modeling literature to develop an algorithm that can learn shared latent rankings with provable consistency as well as sample and computational complexity guarantees. We demonstrate that the new approach is empirically competitive with the current state-of-the-art approaches in predicting preferences on some semi-synthetic and real world datasets.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/ding15.html
https://proceedings.mlr.press/v38/ding15.htmlAveraged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling DistributionsWe consider the least-squares regression problem and provide a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent. In the strongly-convex case, we provide an asymptotic expansion up to explicit exponentially decaying terms. Our analysis leads to new insights into stochastic approximation algorithms: (a) it gives a tighter bound on the allowed step-size; (b) the generalization error may be divided into a variance term which is decaying as O(1/n), independently of the step-size g, and a bias term that decays as O(1/g^2 n^2); (c) when allowing non-uniform sampling of examples over a dataset, the choice of a good sampling density depends on the trade-off between bias and variance: when the variance term dominates, optimal sampling densities do not lead to much gain, while when the bias term dominates, we can choose larger step-sizes that lead to significant improvements.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/defossez15.html
https://proceedings.mlr.press/v38/defossez15.htmlThe Loss Surfaces of Multilayer NetworksWe study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/choromanska15.html
https://proceedings.mlr.press/v38/choromanska15.htmlModel Selection for Topic Models via Spectral DecompositionTopic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following the recent advances in topic models via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. With mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are correct and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis in other latent models.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/cheng15.html
https://proceedings.mlr.press/v38/cheng15.htmlExact Bayesian Learning of Ancestor Relations in Bayesian NetworksAncestor relations in Bayesian networks (BNs) encode long-range causal relations among random variables. In this paper, we develop dynamic programming (DP) algorithms to compute the exact posterior probabilities of ancestor relations in Bayesian networks. Previous algorithm by Parviainen and Koivisto (2011) evaluates all possible ancestor relations in time O(n3^n) and space O(3^n). However, their algorithm assumes an order-modular prior over DAGs that does not respect Markov equivalence. The resulting posteriors would bias towards DAGs consistent with more linear orders. To adhere to the uniform prior, we develop a new DP algorithm that computes the exact posteriors of all possible ancestor relations in time O(n^25^n-1) and space O(3^n). We also discuss the extension of our algorithm to computing the posteriors of s >p >t relations, i.e., a directed path from s to t via p. We apply our algorithm to a biological data set for discovering protein signaling pathways.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/chen15e.html
https://proceedings.mlr.press/v38/chen15e.htmlPredictive Inverse Optimal Control for Linear-Quadratic-Gaussian SystemsPredictive inverse optimal control is a powerful approach for estimating the control policy of an agent from observed control demonstrations. Its usefulness has been established in a number of large-scale sequential decision settings characterized by complete state observability. However, many real decisions are made in situations where the state is not fully known to the agent making decisions. Though extensions of predictive inverse optimal control to partially observable Markov decision processes have been developed, their applicability has been limited by the complexities of inference in those representations. In this work, we extend predictive inverse optimal control to the linear- quadratic-Gaussian control setting. We establish close connections between optimal control laws for this setting and the probabilistic predictions under our approach. We demonstrate the effectiveness and benefit in estimating control policies that are influenced by partial observability on both synthetic and real datasets.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/chen15d.html
https://proceedings.mlr.press/v38/chen15d.htmlFiltered Search for Submodular Maximization with Controllable Approximation BoundsMost existing submodular maximization algorithms provide theoretical guarantees with approximation bounds. However, in many cases, users may be interested in an anytime algorithm that can offer a flexible trade-off between computation time and optimality guarantees. In this paper, we propose a filtered search (FS) framework that allows the user to set an arbitrary approximation bound guarantee with a “tunable knob”, from 0 (arbitrarily bad) to 1 (globally optimal). FS naturally handles monotone and non-monotone functions as well as unconstrained problems and problems with cardinality, matroid, and knapsack constraints. Further, it can also be applied to (non-negative) non-submodular functions and still gives controllable approximation bounds based on their submodularity ratio. Finally, FS encompasses the greedy algorithm as a special case. Our framework is based on theory in A* search, but is substantially more efficient because it only requires heuristics that are critically admissible (CA) rather than admissible—a condition that gives more effective pruning and is substantially easier to implement.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/chen15c.html
https://proceedings.mlr.press/v38/chen15c.htmlEfficient Second-Order Gradient Boosting for Conditional Random FieldsConditional random fields (CRFs) are an important class of models for accurate structured prediction, but effective design of the feature functions is a major challenge when applying CRF models to real world data. Gradient boosting, which is used to automatically induce and select feature functions, is a natural candidate solution to the problem. However, it is non-trivial to derive gradient boosting algorithms for CRFs due to the dense Hessian matrices introduced by variable dependencies. Existing approaches thus use only first-order information when optimizing likelihood, and hence face convergence issues. We incorporate second-order information by deriving a Markov Chain mixing rate bound to quantify the dependencies, and introduce a gradient boosting algorithm that iteratively optimizes an adaptive upper bound of the objective function. The resulting algorithm induces and selects features for CRFs via functional space optimization, with provable convergence guarantees. Experimental results on three real world datasets demonstrate that the mixing rate based upper bound is effective for learning CRFs with non-linear potentials.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/chen15b.html
https://proceedings.mlr.press/v38/chen15b.htmlOne-bit Compressed Sensing with the k-Support NormIn one-bit compressed sensing (1-bit CS), one attempts to estimate a structured parameter (signal) only using the sign of suitable linear measurements. In this paper, we investigate 1-bit CS problems for sparse signals using the recently proposed k-support norm. We show that the new estimator has a closed-form solution, so no optimization is needed. We establish consistency and recovery guarantees of the estimator for both Gaussian and sub-Gaussian random measurements. For Gaussian measurements, our estimator is comparable to the best known in the literature, along with guarantees on support recovery. For sub-Gaussian measurements, our estimator has an irreducible error which, unlike existing results, can be controlled by scaling the measurement vectors. In both cases, our analysis covers the setting of model misspecification, i.e., when the true sparsity is unknown. Experimental results illustrate several strengths of the new estimator.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/chen15a.html
https://proceedings.mlr.press/v38/chen15a.htmlOnline Ranking with Top-1 FeedbackWe consider a setting where a system learns to rank a fixed set of m items. The goal is produce good item rankings for users with diverse interests who interact online with the system for T rounds. We consider a novel top-1 feedback model: at the end of each round, the relevance score for only the top ranked object is revealed. However, the performance of the system is judged on the entire ranked list. We provide a comprehensive set of results regarding learnability under this challenging setting. For PairwiseLoss and DCG, two popular ranking measures, we prove that the minimax regret is Θ(T^2/3). Moreover, the minimax regret is achievable using an efficient strategy that only spends O(m \log m) time per round. The same efficient strategy achieves O(T^2/3) regret for Precision@k. Surprisingly, we show that for normalized versions of these ranking measures, i.e., AUC, NDCG & MAP, no online ranking algorithm can have sublinear regret.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/chaudhuri15.html
https://proceedings.mlr.press/v38/chaudhuri15.htmlImplementable confidence sets in high dimensional regressionWe consider the setting of linear regression in high dimension. We focus on the problem of constructing adaptive and honest confidence sets for the sparse parameter θ, i.e. we want to construct a confidence set for theta that contains theta with high probability, and that is as small as possible. The l_2 diameter of a such confidence set should depend on the sparsity S of θ- the larger S, the wider the confidence set. However, in practice, S is unknown. This paper focuses on constructing a confidence set for θwhich contains θwith high probability, whose diameter is adaptive to the unknown sparsity S, and which is implementable in practice.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/carpentier15.html
https://proceedings.mlr.press/v38/carpentier15.htmlStochastic Spectral Descent for Restricted Boltzmann MachinesRestricted Boltzmann Machines (RBMs) are widely used as building blocks for deep learning models. Learning typically proceeds by using stochastic gradient descent, and the gradients are estimated with sampling methods. However, the gradient estimation is a computational bottleneck, so better use of the gradients will speed up the descent algorithm. To this end, we first derive upper bounds on the RBM cost function, then show that descent methods can have natural ad- vantages by operating in the L∞and Shatten-∞norm. We introduce a new method called “Stochastic Spectral Descent” that updates parameters in the normed space. Empirical results show dramatic improvements over stochastic gradient descent, and have only have a fractional increase on the per-iteration cost.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/carlson15.html
https://proceedings.mlr.press/v38/carlson15.htmlAccurate and conservative estimates of MRF log-likelihood using reverse annealingMarkov random fields (MRFs) are difficult to evaluate as generative models because computing the test log-probabilities requires the intractable partition function. Annealed importance sampling (AIS) is widely used to estimate MRF partition functions, and often yields quite accurate results. However, AIS is prone to overestimate the log-likelihood with little indication that anything is wrong. We present the Reverse AIS Estimator (RAISE), a stochastic lower bound on the log-likelihood of an approximation to the original MRF model. RAISE requires only the same MCMC transition operators as standard AIS. Experimental results indicate that RAISE agrees closely with AIS log-probability estimates for RBMs, DBMs, and DBNs, but typically errs on the side of underestimating, rather than overestimating, the log-likelihood.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/burda15.html
https://proceedings.mlr.press/v38/burda15.htmlGeneralized Linear Models for Aggregated DataDatabases in domains such as healthcare are routinely released to the public in aggregated form. Unfortunately, naive modeling with aggregated data may significantly diminish the accuracy of inferences at the individual level. This paper addresses the scenario where features are provided at the individual level, but the target variables are only available as histogram aggregates or order statistics. We consider a limiting case of generalized linear modeling when the target variables are only known up to permutation, and explore how this relates to permutation testing; a standard technique for assessing statistical dependency. Based on this relationship, we propose a simple algorithm to estimate the model parameters and individual level inferences via alternating imputation and standard generalized linear model fitting. Our results suggest the effectiveness of the proposed approach when, in the original data, permutation testing accurately ascertains the veracity of the linear relationship. The framework is extended to general histogram data with larger bins - with order statistics such as the median as a limiting case. Our experimental results on simulated data and aggregated healthcare data suggest a diminishing returns property with respect to the granularity of the histogram - when a linear relationship holds in the original data, the targets can be predicted accurately given relatively coarse histograms.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/bhowmik15.html
https://proceedings.mlr.press/v38/bhowmik15.htmlSparse Solutions to Nonnegative Linear Systems and ApplicationsWe give an efficient algorithm for finding sparse approximate solutions to linear systems of equations with nonnegative coefficients. Unlike most known results for sparse recovery, we do not require \emphany assumption on the matrix other than non-negativity. Our algorithm is combinatorial in nature, inspired by techniques for the “set cover” problem, as well as the multiplicative weight update method. We then present a natural application to learning mixture models in the PAC framework. For learning a mixture of k axis-aligned Gaussians in d dimensions, we give an algorithm that outputs a mixture of O(k/ε^3) Gaussians that is ε-close in statistical distance to the true distribution, without any separation assumptions. The time and sample complexity is roughly O(kd/ε^3)^d. This is polynomial when d is constant – precisely the regime in which known methods fail to identify the components efficiently. Given that non-negativity is a natural assumption, we believe that our result may find use in other settings in which we wish to approximately “explain” data using a small number of a (large) candidate set of components.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/bhaskara15.html
https://proceedings.mlr.press/v38/bhaskara15.htmlGaussian Processes for Bayesian hypothesis tests on regression functionsGaussian processes have been used in different application domains such as classification, regression etc. In this paper we show that they can also be employed as a universal tool for developing a large variety of Bayesian statistical hypothesis tests for regression functions. In particular, we will use GPs for testing whether (i) two functions are equal; (ii) a function is monotone (even accounting for seasonality effects); (iii) a function is periodic; (iv) two functions are proportional. By simulation studies, we will show that, beside being more flexible, GP tests are also competitive in terms of performance with state-of-art algorithms.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/benavoli15.html
https://proceedings.mlr.press/v38/benavoli15.htmlConvex Multi-Task Learning by ClusteringWe consider the problem of multi-task learning in which tasks belong to hidden clusters. We formulate the learning problem as a novel convex optimization problem in which linear classifiers are combinations of (a small number of) some basis. Our formulation jointly learns both the basis and the linear combination. We propose a scalable optimization algorithm for finding the optimal solution. Our new methods outperform existing state-of-the-art methods on multi-task sentiment classification tasks.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/barzilai15.html
https://proceedings.mlr.press/v38/barzilai15.htmlNear-optimal max-affine estimators for convex regressionThis paper considers least squares estimators for regression problems over convex, uniformly bounded, uniformly Lipschitz function classes minimizing the empirical risk over max-affine functions (the maximum of finitely many affine functions). Based on new results on nonlinear nonparametric regression and on the approximation accuracy of max-affine functions, these estimators are proved to achieve the optimal rate of convergence up to logarithmic factors. Preliminary experiments indicate that a simple randomized approximation to the optimal estimator is competitive with state-of-the-art alternatives.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/balazs15.html
https://proceedings.mlr.press/v38/balazs15.htmlUnifying Local Consistency and MAX SAT Relaxations for Scalable Inference with Rounding GuaranteesWe prove the equivalence of first-order local consistency relaxations and the MAX SAT relaxation of Goemans and Williamson (1994) for a class of MRFs we refer to as logical MRFs. This allows us to combine the advantages of each into a single MAP inference technique: solving the local consistency relaxation with any of a number of highly scalable message-passing algorithms, and then obtaining a high-quality discrete solution via a guaranteed rounding procedure when the relaxation is not tight. Logical MRFs are a general class of models that can incorporate many common dependencies, such as logical implications and mixtures of supermodular and submodular potentials. They can be used for many structured prediction tasks, including natural language processing, computer vision, and computational social science. We show that our new inference technique can improve solution quality by as much as 20% without sacrificing speed on problems with over one million dependencies.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/bach15.html
https://proceedings.mlr.press/v38/bach15.htmlEfficient Sparse Clustering of High-Dimensional Non-spherical Gaussian MixturesWe consider the problem of clustering data points in high dimensions, i.e., when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with two non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Further, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/azizyan15.html
https://proceedings.mlr.press/v38/azizyan15.htmlGlobal Multi-armed Bandits with Hölder ContinuityStandard Multi-Armed Bandit (MAB) problems assume that the arms are independent. However, in many application scenarios, the information obtained by playing an arm provides information about the remainder of the arms. Hence, in such applications, this informativeness can and should be exploited to enable faster convergence to the optimal solution. In this paper, formalize a new class of multi-armed bandit methods, Global Multi-armed Bandit (GMAB), in which arms are globally informative through a global parameter, i.e., choosing an arm reveals information about all the arms. We propose a greedy policy for the GMAB which always selects the arm with the highest estimated expected reward, and prove that it achieves bounded parameter-dependent regret. Hence, this policy selects suboptimal arms only finitely many times, and after a finite number of initial time steps, the optimal arm is selected in all of the remaining time steps with probability one. In addition, we also study how the informativeness of the arms about each other’s rewards affects the speed of learning. Specifically, we prove that the parameter-free (worst-case) regret is sublinear in time, and decreases with the informativeness of the arms. We also prove a sublinear in time Bayesian risk bound for the GMAB which reduces to the well-known Bayesian risk bound for linearly parameterized bandits when the arms are fully informative. GMABs have applications ranging from drug dosage control to dynamic pricing.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/atan15.html
https://proceedings.mlr.press/v38/atan15.htmlSpectral Gap Error Bounds for Improving CUR Matrix Decomposition and the Nyström MethodThe CUR matrix decomposition and the related Nyström method build low-rank approximations of data matrices by selecting a small number of representative rows and columns of the data. Here, we introduce novel \emphspectral gap error bounds that judiciously exploit the potentially rapid spectrum decay in the input matrix, a most common occurrence in machine learning and data analysis. Our error bounds are much tighter than existing ones for matrices with rapid spectrum decay, and they justify the use of a constant amount of oversampling relative to the rank parameter k, i.e, when the number of columns/rows is \ell=k+ O(1). We demonstrate our analysis on a novel deterministic algorithm, \emphStableCUR, which additionally eliminates a previously unrecognized source of potential instability in CUR decompositions. While our algorithm accepts any method of row and column selection, we implement it with a recent column selection scheme with strong singular value bounds. Empirical results on various classes of real world data matrices demonstrate that our algorithm is as efficient as and often outperforms competing algorithms.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/anderson15.html
https://proceedings.mlr.press/v38/anderson15.htmlParameter Estimation of Generalized Linear Models without Assuming their Link FunctionCanonical generalized linear models (GLM) are completely specified by a finite dimensional vector and a monotonically increasing function called the link function. Standard parameter estimation techniques hold the link function fixed and optimizes over the parameter vector. We propose a parameter-recovery facilitating, jointly-convex, regularized loss functional that is optimized globally over the vector as well as the link function, with best rates possible under a first order oracle model. This widens the scope of GLMs to cases where the link function is unknown.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/acharyya15.html
https://proceedings.mlr.press/v38/acharyya15.htmlNonparametric Bayesian Factor Analysis for Dynamic Count MatricesA gamma process dynamic Poisson factor analysis model is proposed to factorize a dynamic count matrix, whose columns are sequentially observed count vectors. The model builds a novel Markov chain that sends the latent gamma random variables at time (t-1) as the shape parameters of those at time t, which are linked to observed or latent counts under the Poisson likelihood. The significant challenge of inferring the gamma shape parameters is fully addressed, using unique data augmentation and marginalization techniques for the negative binomial distribution. The same nonparametric Bayesian model also applies to the factorization of a dynamic binary matrix, via a Bernoulli-Poisson link that connects a binary observation to a latent count, with closed-form conditional posteriors for the latent counts and efficient computation for sparse observations. We apply the model to text and music analysis, with state-of-the-art results.Sat, 21 Feb 2015 00:00:00 +0000
https://proceedings.mlr.press/v38/acharya15.html
https://proceedings.mlr.press/v38/acharya15.html