Proceedings of Machine Learning Research

Multiple DAGs Learning with Non-negative Matrix Factorization

Sun, 03 Sep 2017 00:00:00 +0000

Probabilistic graphical models, e.g., Markov network and Bayesian network have been well studied in the past two decades. However, it is still difficult to learn a reliable network structure, especially with limited data. Recent works found multi-task learning can improve the robustness of the learned networks by leveraging data from related tasks. In this paper, we focus on the estimation of Direct Acyclic Graph (DAG) of Bayesian network. Most existing multi-task or transfer learning algorithms for Bayesian network use the DAG relatedness as an inductive bias in the optimization of multiple structures. More specifically, some works firstly find shared hidden structures among related tasks, and then treat them as the structure penalties in the learning step. However, current works omit the setting that the shared hidden structure comes from different parts of different DAGs. Thus, in this paper, the Non-negative Matrix Factorization (NMF) is employed to learn a parts-based representation to mediate this problem. Theoretically, we show the plausibility of our approach. Empirically, we show that compared to single task learning, multi-task learning is better able to positively identify true edges with synthetic data and real-world landmine data.

Causal Learning and Machine Learning

Sun, 03 Sep 2017 00:00:00 +0000

Can we find the causal direction between two variables? How can we make optimal predictions in the presence of distribution shift? We are often faced with such causal modeling or prediction problems. Recently, with the rapid accumulation of huge volumes of data, both causal discovery, i.e., learning causal information from purely observational data, and machine learning are seeing exciting opportunities as well as great challenges. This talk will be focused on recent advances in causal discovery and how causal information facilitates understanding and solving certain problems of learning from heterogeneous data. In particular, I will talk about basic approaches to causal discovery and address practical issues in causal discovery, including nonstationarity or heterogeneity of the data and existence of measurement error. Finally, I will discuss why and how underlying causal knowledge helps in learning from heterogeneous data when the i.i.d. assumption is dropped, with transfer learning? as a particular example.

Hidden Node Detection between Two Observable Nodes Based on Bayesian Clustering

Sun, 03 Sep 2017 00:00:00 +0000

The structure learning is one of the main concerns in studies of the Bayesian networks. In the present paper, we consider the network consisting of both observable and hidden nodes, and propose a method to investigate the existence of a hidden node between two observable nodes, which is the model selection problem between the networks with and without the middle hidden node. When the network includes a hidden node, it has been known that there are singularities in the parameter space, and the Fisher information matrix is not positive definite. Then, the many conventional criteria for the structure learning based on the Laplace approximation do not work. The proposed method is based on the Bayesian clustering, and its asymptotic property justifies the result; the redundant labels are eliminated and the simplest structure is detected even if there are singularities.

Fast Compilation of s-t Paths on a Graph for Counting and Enumeration

Sun, 03 Sep 2017 00:00:00 +0000

In this paper, we propose a new method to compile $s$-$t$ simple paths on a graph using a new compilation method called merging frontier based search. Recently, Nishino et al. proposed a top-down construction algorithm, which compiles $s$-$t$ simple paths into a Zero-suppressed SDD (ZSDD), and they showed that this method is more efficient than simpath by Knuth. However, since the method of Nishino et al. uses ZSDD as a tractable representation, it requires complicated steps for compilation. In this paper, we propose z-st-d-DNNF, which is a super set of ZSDD. By using this method instead of ZSDD, we show that more efficient $s$-$t$ simple paths compilation can be realized.

Restricted Quasi Bayesian Networks as a Prototyping Tool for Computational Models of Individual Cortical Areas

Sun, 03 Sep 2017 00:00:00 +0000

We propose \textit{restricted quasi Bayesian networks} as an efficient prototyping tool for designing computational models of individual cortical areas of the brain. Restricted quasi Bayesian networks are simplified Bayesian networks that only distinguish probability value 0 from other values. Using our tool, it is possible to concentrate on the essential part of model design and efficiently construct prototypes. We demonstrate that restricted quasi Bayesian networks actually work well as a prototyping tool by implementing a syntactic parser for an ambiguous English sentence.

Advanced Methodologies for Bayesian Networks 2017: Preface

Sun, 03 Sep 2017 00:00:00 +0000

Preface

Hyperparameter sensitivity revisited

Sun, 03 Sep 2017 00:00:00 +0000

The BDeu scoring criterion for learning Bayesian network structures is known to be very sensitive to the equivalent sample size hyper-parameter. Recently some authors have suggested alternative Bayesian scoring criteria that appear to behave better than BDeu. So is the problem solved? We will review the problem and suggested solutions and present empirical assessment of the current situation.

Fast Message Passing Algorithm Using ZDD-Based Local Structure Compilation

Sun, 03 Sep 2017 00:00:00 +0000

Compiling Bayesian Networks (BNs) into secondary structures to implement efficient exact inference is a hot topic in probabilistic modeling. One class of algorithms to compile BNs is to transform the BNs into junction tree structures utilizing the conditional dependency in the network. Performing message passing on the junction tree structure, we can calculate marginal probabilities for any variables in the network efficiently. However, the message passing algorithm does not consider the local structure in the network. Since the ability to exploit local structure to avoid redundant calculations has a significant impact on exact inference, in this article, we propose a fast message passing algorithm by exploiting local structure using Zero-suppressed Binary Decision Diagrams (ZDDs). We convert all the components used in message passing algorithm into Multi-linear Functions (MLFs), and then compile them into compact representation using ZDDs. We show that message passing on ZDDs can work more efficient than the conventional message passing algorithm on junction tree structures on some benchmark networks although it may be too memory consuming for some larger instances.

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle

Sun, 03 Sep 2017 00:00:00 +0000

A classic approach for learning Bayesian networks from data is to select the \emph{maximum a posteriori} (MAP) network. In the case of discrete Bayesian networks, the MAP network is selected by maximising one of several possible Bayesian Dirichlet (BD) scores; the most famous is the \emph{Bayesian Dirichlet equivalent uniform} (BDeu) score from Heckerman \emph{et al.} (1995). The key properties of BDeu arise from its underlying uniform prior, which makes structure learning computationally efficient; does not require the elicitation of prior knowledge from experts; and satisfies score equivalence. In this paper we will discuss the impact of this uniform prior on structure learning from an information theoretic perspective, showing how BDeu may violate the maximum entropy principle when applied to sparse data and how it may also be problematic from a Bayesian model selection perspective. On the other hand, the BDs score proposed in Scutari (2016) arises from a piecewise prior and it does not appear to violate the maximum entropy principle, even though it is asymptotically equivalent to BDeu.

Improved Local Search in Bayesian Networks Structure Learning

Sun, 03 Sep 2017 00:00:00 +0000

We present a novel approach for score-based structure learning of Bayesian network, which couples an existing ordering-based algorithm for structure optimization with a novel operator for exploring the neighborhood of a given order in the space of the orderings. Our approach achieves state-of-the-art performances in data sets containing thousands of variables.

Learning probability by comparison

Sun, 03 Sep 2017 00:00:00 +0000

Learning probability by probabilistic modeling is a major task in statistical machine learning and it has traditionally been supported by maximum likelihood estimation applied to generative models or by a local maximizer applied to discriminative models. In this talk, we introduce a third approach, an innovative one that learns probability by comparing probabilistic events. In our approach, we give the ranking of probabilistic events and the system learns a probability distribution so that the ranking is well respected. We implemented this approach in PRISM, a logic-based probabilistic programming language, and conducted learning experiments with real data for models described by PRISM programs.

Learning Causal AMP Chain Graphs

Sun, 03 Sep 2017 00:00:00 +0000

Andersson-Madigan-Perlman chain graphs were originally introduced to represent independence models. They have recently been shown to be suitable for representing causal models with additive noise. In this paper, we present an algorithm for learning causal chain graphs. The algorithm builds on the ideas by \citet{Hoyeretal.2009}, i.e. it exploits the nonlinearities in the data to identify the direction of the causal relationships. We also report experimental results on real-world data.

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs

Sun, 03 Sep 2017 00:00:00 +0000

Alternative acyclic directed mixed graphs (ADMGs) are graphs that may allow causal effect identification in scenarios where Pearl's original ADMGs may not, and vice versa. Therefore, they complement each other. In this paper, we introduce a sound algorithm for identifying arbitrary causal effects from alternative ADMGs. Moreover, we show that the algorithm is complete for identifying the causal effect of a single random variable on the rest. We also show that the algorithm follows from a calculus similar to Pearl's do-calculus.

Consistent Learning Bayesian Networks with Thousands of Variables

Sun, 03 Sep 2017 00:00:00 +0000

We have already proposed a constraint-based learning Bayesian network method using Bayes factor. Since a conditional independence test using Bayes factor has consistency, the learning method improves the learning accuracy of the traditional constraint-based learning methods. Additionally, the method is expected to learn larger network structures than the traditional methods do because it greatly improves computational efficiency. However, its expected benefits have not been demonstrated empirically. This report describes some experiments related to the learning of large network structures. Results show that the proposed method can learn surprisingly huge networks with thousands of variables.

An Experimental Analysis of Anytime Algorithms for Bayesian Network Structure Learning

Sun, 03 Sep 2017 00:00:00 +0000

Bayesian networks are a widely used graphical model with diverse applications in knowledge discovery, classification, and decision making. Learning a Bayesian network from discrete data can be cast as a combinatorial optimization problem and thus solved using optimization techniques---the well-known \emph{score-and-search} approach. An important consideration when applying a score-and-search method for Bayesian network structure learning (BNSL) is its anytime behavior; i.e., how does the quality of the solution found improve as a function of the amount of time given to the algorithm. Previous studies of the anytime behavior of methods for BNSL are limited by the scale of the instances used in the evaluation and evaluate only algorithms that do not scale to larger instances. In this paper, we perform an extensive evaluation of the anytime behavior of the current state-of-the-art algorithms for BNSL. Our benchmark instances range from small (instances with fewer than 20 random variables) to massive (instances with more than 1,500 random variables). We find that a local search algorithm based on memetic search dominates the performance of other state-of-the-art algorithms when considering anytime behavior.

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

Sun, 03 Sep 2017 00:00:00 +0000

In this paper, we analyze the size of decision diagrams (DD) representing the set of all parse trees of a context-free grammar (CFG). CFG is widely used in the field of natural language processing and bioinformatics to estimate the hidden structures of sequence data. A decision diagram is a data structure that represents a Boolean function in a concise form. By using DDs to represent the set of all parse trees, we can efficiently perform many useful operations over the parse trees, such as finding trees that satisfy additional constraints and finding the best parse tree. Since the time complexity of these operations depends on DD size, selecting an appropriate DD variant is important. Experiments on a simple CFG show that the Zero-suppressed Sentential Decision Diagram (ZSDD) is better than other DDs; we also give theoretical upper bounds on ZSDD size.

Few-to-few Cross-domain Object Matching

Sun, 03 Sep 2017 00:00:00 +0000

Cross-domain object matching refers to the task of inferring unknown alignment between objects in two data collections that do not have a shared data representation. In recent years several methods have been proposed for solving the special case that assumes each object is to be paired with exactly one object, resulting in a constrained optimization problem over permutations. A related problem formulation of cluster matching seeks to match a cluster of objects in one data set to a cluster of objects in the other set, which can be considered as many-to-many extension of cross-domain object matching and can be solved without explicit constraints. In this work we study the intermediate region between these two special cases, presenting a range of Bayesian inference algorithms that work also for few-to-few cross-domain object matching problems where constrained optimization is necessary but the optimization domain is broader than just permutations.

Analyzing Tandem Mass Spectra: A Graphical Models Perspective

Sun, 03 Sep 2017 00:00:00 +0000

In the past two decades, the field of proteomics has seen explosive growth, largely due to the development of tandem mass spectrometry (MS/MS). With a complex biological sample as input, a typical MS/MS experiment quickly produces a large (often numbering in the hundreds-of-thousands) collection of spectra representative of the proteins present in the original complex sample. A majority of widely used methods to search and identify MS/MS spectra use scoring functions which rely on static, hand-selected parameters rather than affording the ability to learn parameters and adapt to the widely varying characteristics of MS/MS data. In this talk, we discuss recent work utilizing dynamic Bayesian networks (DBNs) to identify MS/MS spectra. In particular, we discuss a recently proposed DBN for Rapid Identification of Peptides (DRIP) which, in contrast to popular scoring functions, allows efficient generative and discriminative learning of parameters to achieve state-of-theart spectrum-identification accuracy. Furthermore, facilitated by DRIP’s generative nature, we present current innovations leveraging DBNs to significantly enhance many other aspects of MS/MS analysis, such as improving downstream discriminative classification via detailed feature extraction and speeding up identification runtime using trellises and approximate inference.

Learning Bayesian Network Parameters with Domain Knowledge and Insufficient Data

Sun, 03 Sep 2017 00:00:00 +0000

To improve the learning accuracy of parameters in a Bayesian network (BN) from limited data, domain knowledge is often incorporated into the learning process as parameter con- straints. Maximum a posteriori (MAP) based methods that use both data and constraints have been studied extensively. Among those methods, the qualitatively maximum a pos- teriori (QMAP) method exhibits high learning performance. In the QMAP method, when the data are limited, estimation from the data often fails to satisfy all the parameter con- straints, which makes the overall QMAP estimation unreliable. To ensure that the QMAP estimation does not violate any given parameter constraint and further improve the learn- ing accuracy, in this paper, we propose a qualitatively maximum a posteriori correction (QMAP-C) estimation algorithm, which regulates QMAP estimation by replacing the data estimation with a further constrained estimation. Experiments show that the proposed al- gorithm outperforms most of the existing parameter learning methods when the parameter constraints are correct.

Reducing the Cost of Probabilistic Knowledge Compilation

Sun, 03 Sep 2017 00:00:00 +0000

Bayesian networks (BN) are a popular representation for reasoning under uncertainty. The computational complexity of inference, however, hinders its applicability to many real-world domains that in principle can be modeled by BNs. Inference methods based on Weighted Model Counting (WMC) reduce the cost of inference by exploiting patterns exhibited by the probabilities associated with BN nodes. However, these methods require a computationally intensive compilation step in search of these patterns, limiting the number of BNs that are eligible based on their size. In this paper, we aim to extend WMC methods in general by proposing a scalable, compilation framework that is language agnostic, which solves this problem by partitioning BNs and compiling them as a set of smaller sub-problems. This reduces the cost of compilation and allows state-of-the-art innovations in WMC to be applied to a much larger range of Bayesian networks.

Incorporating Uncertain Evidence Into Arithmetic Circuits Representing Probability Distributions

Sun, 03 Sep 2017 00:00:00 +0000

Arithmetic circuits have been used as tractable representations of probability distributions, either generated from models such as Bayesian networks, sum-product networks and Probability Sentential Decision Diagrams, or directly from data. An interesting question is how we can incorporate uncertain evidence, which specifies that the marginal probabilities of a variable has to undergo certain changes, directly into an arithmetic circuit and then perform reasoning on it to compute the probability distribution after incorporating this uncertain evidence. In this paper, we show that we can incorporate uncertain evidence on a variable by setting indicators of this variable in the arithmetic circuit to non-negative values based on the likelihood ratios in Pearl's method of virtual evidence and the current marginal probabilities of this variable. For tractable computation of these marginal probabilities, the arithmetic circuit has to satisfy the properties of decomposability and smoothness, and we show that an algorithm using a downward pass can compute these marginal probabilities for all single variables. We show a procedure of how to incorporate virtual evidence, including multiple pieces of virtual evidence.

Backoff methods for estimating parameters of a Bayesian network

Sun, 03 Sep 2017 00:00:00 +0000

Various authors have highlighted inadequacies of BDeu type scores and this problem is shared in parameter estimation. Basically, Laplace estimates work poorly, at least because setting the prior concentration is challenging. In 1997, Freidman et al suggested a simple backoff approach for Bayesian network classifiers (BNCs). Backoff methods dominate in in n-gram language models, with modified Kneser-Ney smoothing, being the best known, and a Bayesian variant exists in the form of Pitman-Yor process language models from Teh in 2006. In this talk we will present some results on using backoff methods for Bayes network classifiers and Bayesian networks generally. For BNCs at least, the improvements are dramatic and alleviate some of the issues of choosing too dense a network.