Proceedings of Machine Learning Research

A Note on the Comparison of Polynomial Selection Methods

Sun, 03 Jan 1999 00:00:00 +0000

Minimum Message Length (MML) and Structural Risk Minimisation (SRM) are two computational learning principles that have achieved wide acclaim in recent years. Whereas the former is based on Bayesian learning and the latter on the classical theory of VC-dimension, they are similar in their attempt to define a trade-off between model complexity and goodness of fit to the data. A recent empirical study by Wallace compared the performance of standard model selection methods in a one-dimensional polynomial regression framework. The results from this study provided strong evidence in support of the MML and SRM based methods over the other standard approaches. In this paper we present a detailed empirical evaluation of three model selection methods which include an MML based approach and two SRM based methods. Results from our analysis and experimental evaluation suggest that the MML-based approach in general has higher predictive accuracy and also raise questions on the inductive capabilities of the Structural Risk Minimization Principle.

An experiment in causal discovery using a pneumonia database

Sun, 03 Jan 1999 00:00:00 +0000

We tested a causal discovery algorithm on a database of pneumonia patients. The output of the causal discovery algorithm is a list of statements "A causes B", where A and B are variables in the database, and a score indicating the degree of confidence in the statement. We compared the output of the algorithm with the opinions of physicians about whether A caused B or not. We found that the doctors opinions were independent of the output of the algorithm. However, an examination of the output of results suggested a simple, well motivated modification of the algorithm which would bring the output of the algorithm into high agreement with the physicians opinions.

Boosting Methodology for Regression Problems

Sun, 03 Jan 1999 00:00:00 +0000

Classification problems have dominated research on boosting to date. The application of boosting to regression problems, on the other hand, has received little investigation. In this paper we develop a new boosting method for regression problems. We cast the regression problem as a classification problem and apply an interpretable form of the boosted naïve Bayes classifier. This induces a regression model that we show to be expressible as an additive model for which we derive estimators and discuss computational issues. We compare the performance of our boosted naïve Bayes regression model with other interpretable multivariate regression procedures.

Tractable Structure Search in the Presence of Latent Variables

Sun, 03 Jan 1999 00:00:00 +0000

The problem of learning the structure of a DAG model in the presence of latent variables presents many formidable challenges. In particular there are an infinite number of latent variable models to consider, and these models possess features which make them hard to work with. We describe a class of graphical models which can represent the conditional independence structure induced by a latent variable model over the observed margin. We give a parametrization of the set of Gaussian distributions with conditional independence structure given by a MAG model. The models are illustrated via a simple example. Different estimation techniques are discussed in the context of Zellner’s Seemingly Unrelated Regression (SUR) models.

Efficient Mining of Statistical Dependencies

Sun, 03 Jan 1999 00:00:00 +0000

The Multi-Stream Dependency Detection algorithm finds rules that capture statistical dependencies between patterns in multivariate time series of categorical data. Rule strength is measured by the G statistic, and an upper bound on the value of G for the descendants of a node allows MSDD’s search space to be pruned. However, in the worst case, the algorithm will explore exponentially many rules. This paper presents and empirically evaluates two ways of addressing this problem. The first is a set of three methods for reducing the size of MSDD’s search space based on information collected during the search process. Second, we discuss an implementation of MSDD that distributes its computations over multiple machines on a network.

Bayesian Graphical Models, Intention-to-Treat, and the Rubin Causal Model

Sun, 03 Jan 1999 00:00:00 +0000

In clinical trials with significant noncompliance the standard intention-to-treat analyses sometimes mislead. Rubin’s causal model provides an alternative method of analysis that can shed extra light on clinical trial data. Formulating the Rubin Causal Model as a Bayesian graphical model facilitates model communication and computation.

Stochastic Local Search for Bayesian Network

Sun, 03 Jan 1999 00:00:00 +0000

The paper evaluates empirically the suitability of Stochastic Local Search algorithms (SLS) for finding most probable explanations in Bayesian networks. SLS algorithms (e.g., GSAT, WSAT [16]) have recently proven to be highly effective in solving complex constraint-satisfaction and satisfiability problems which cannot be solved by traditional search schemes. Our experiments investigate the applicability of this scheme to probabilistic optimization problems. Specifically, we show that algorithms combining hill-climbing steps with stochastic steps (guided by the network’s probability distribution) called G+StS, outperform pure hill-climbing search, pure stochastic simulation search, as well as simulated annealing. In addition, variants of G+StS that are aug- mented on top of alternative approximation methods are shown to be particularly effective.

Hierarchical Mixtures-of-Experts for Generalized Linear Models: Some Results on Denseness and Consistency

Sun, 03 Jan 1999 00:00:00 +0000

We investigate a class of hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form $\psi(a+x^T b)$ are mixed. Here $\psi(\cdot)$ is the inverse link function. Suppose the true response $y$ follows an exponential family regression model with mean function belonging to a class of smooth functions of the form $\psi(h(x))$ where $h \in W_{2;K_0}^\infty$ (a Sobolev class over $[0,1]^{s}$). It is shown that the HME mean functions can approximate the true mean function, at a rate of $O(m^{-2/s})$ in $L_p$ norm. Moreover, the HME probability density functions can approximate the true density, at a rate of $O(m^{-2/s})$ in Hellinger distance, and at a rate of $O(m^{-4/s})$ in Kullback-Leibler divergence. These rates can be achieved within the family of HME structures with a tree of binary splits, or within the family of structures with a single layer of experts. Here $s$ is the dimension of the predictor $x$. It is also shown that likelihood-based inference based on HME is consistent in recovering the truth, in the sense that as the sample size $n$ and the number of experts $m$ both increase, the mean square error of the estimated mean response goes to zero. Conditions for such results to hold are stated and discussed.

Probabilistic Kernel Regression Models

Sun, 03 Jan 1999 00:00:00 +0000

We introduce a class of flexible conditional probability models and techniques for classification/regression problems. Many existing methods such as generalized linear models and support vector machines are subsumed under this class. The flexibility of this class of techniques comes from the use of kernel functions as in support vector machines, and the generality from dual formulations of standard regression models.

Mean field inference in a general probabilistic setting

Sun, 03 Jan 1999 00:00:00 +0000

We present a systematic, model-independent formulation of mean field theory (MFT) as an inference method in probabilistic models. "Model-independent" means that we do not assume a particular type of dependency among the variables of a domain but instead work in a general probabilistic setting. In a Bayesian network, for example, you may use arbitrary tables to specify conditional dependencies and thus run MFT in any Bayesian network. Furthermore, the general mean field equations derived here shed a light on the essence of MFT. MFT can be interpreted as a local iteration scheme which relaxes in a consistent state (a solution of the mean field equations). Iterating the mean field equations means propagating information through the network. In general, however, there are multiple solutions to the mean field equations. We show that improved approximations can be obtained by forming a weighted mixture of the multiple mean field solutions. Simple approximate expressions for the mixture weights are given. The benefits of taking into account multiple solutions are demonstrated by using MFT for inference in a small Bayesian network representing a medical domain. Thereby it turns out that every solution of the mean field equations can be interpreted as a ’disease scenario’.

Model choice: A minimum posterior predictive loss approach

Sun, 03 Jan 1999 00:00:00 +0000

Model choice is a fundamental activity in the analysis of data sets, an activity which has become increasingly more important as computational advances enable the fitting of increasingly complex models. Such complexity typically arises through hierarchical structure which requires specification at each stage of probabilistic mechanisms, mean and dispersion forms, explanatory variables, etc. Nonnested hierarchical models introducing random effects may not be handled by classical methods. Bayesian approaches using predictive distributions can be used though the FORMAL solution, which includes Bayes factors as a special case, can be criticized. It seems natural to evaluate model performance by comparing what it predicts with what has been observed. Most classical criteria utilize such comparison. We propose a predictive criterion where the goal is good prediction of a replicate of the observed data but tempered by fidelity to the observed values. We obtain this criterion by minimizing posterior loss for a given model and then, for models under consideration, selecting the one which minimizes this criterion. For a version of log scoring loss we can do the minimization explicitly, obtaining an expression which can be interpreted as a penalized deviance criterion. We illustrate its performance with an application to a large data set involving residential property transactions.

On the geometry of DAG models with hidden variables

Sun, 03 Jan 1999 00:00:00 +0000

We prove that many graphical models with hidden variables are not curved exponential families. This result, together with the fact that some graphical models are curved and not linear, implies that the hierarchy of graphical models, as linear, curved, and stratified, is non-collapsing; each level in the hierarchy is strictly contained in the larger levels. This result is discussed in the context of model selection of graphical models.

Geometric modeling of a nuclear environment

Sun, 03 Jan 1999 00:00:00 +0000

This paper is about the task-directed updating of an incomplete and inaccurate geometric model of a nuclear environment, using only robust radiation-resistant sensors installed on a robot that is remotely controlled by a human operator. In this problem, there are many sources of uncertainty and ambiguity. This paper proposes a probabilistic solution under Gaussian assumptions. Uncertainty is reduced with an estimator based on a Kalman filter. Ambiguity on the measurement-feature association is resolved by running a bank of those estimators in parallel, one for each plausible association. The residual errors of these estimators are used for hypothesis testing and for the calculation of a probability distribution over the remaining hypotheses. The best next sensing action is calculated as a Bayes decision with respect to a loss function that takes into account both the uncertainty on the current estimate, and the variance/precision required by the task.

Efficient learning using constrained sufficient statistics

Sun, 03 Jan 1999 00:00:00 +0000

Learning Bayesian networks is a central problem for pattern recognition, density estimation and classification. In this paper, we propose a new method for speeding up the computational process of learning Bayesian network structure. This approach uses constraints imposed by the statistics already collected from the data to guide the learning algorithm. This allows us to reduce the number of statistics collected during learning and thus speed up the learning time. We show that our method is capable of learning structure from data more efficiently than traditional approaches. Our technique is of particular importance when the size of the datasets is large or when learning from incomplete data. The basic technique that we introduce is general and can be used to improve learning performance in many settings where sufficient statistics must be computed. In addition, our technique may be useful for alternate search strategies such as branch and bound algorithms.

Modeling decision tree performance with the power law

Sun, 03 Jan 1999 00:00:00 +0000

This paper discusses the use of a power law to predict decision tree performance. Power laws are fit to learning curves of decision trees trained on data sets from the UCI repository. The learning curves are generated by training C4.5 on different size training sets. The power law predicts diminishing returns in terms of error rate as training set size increase. By characterizing the learning curve with a power law, the error rate for a given size training set can be projected. This projection can be used in estimating the amount of data needed to achieve an acceptable error rate, and the cost effectiveness of further data collection.

Process-oriented evaluation: The next step

Sun, 03 Jan 1999 00:00:00 +0000

Methods to avoid overfitting fall into two broad categories: data-oriented (using separate data for validation) and representation-oriented (penalizing complexity in the model). Both have limitations that are hard to overcome. We argue that fully adequate model evaluation is only possible if the search process by which models are obtained is also taken into account. To this end, we recently proposed a method for process-oriented evaluation (POE), and successfully applied it to rule induction (Domingos, 1998). However, for the sake of simplicity this treatment made two rather artificial assumptions. In this paper the assumptions are removed, and a simple formula for model evaluation is obtained. Empirical trials show the new, better-founded form of POE to be as accurate as the previous one, while further reducing theory sizes.

Conditional products: An alternative approach to conditional independence

Sun, 03 Jan 1999 00:00:00 +0000

We introduce a new abstract approach to the study of conditional independence, founded on a concept analogous to the factorization properties of probabilistic independence, rather than the separation properties of a graph. The basic ingredient is the "conditional product", which provides a way of combining the basic objects under consideration while preserving as much independence as possible. We introduce an appropriate axiom system for conditional product, and show how, when these axioms are obeyed, they induce a derived concept of conditional independence which obeys the usual semi-graphoid axioms. The general structure is used to throw light on three specific areas: the familiar probabilistic framework (both the discrete and the general case); a set-theoretic framework related to "variation independence"; and a variety of graphical frameworks.

Causal Mechanisms and Classification Trees for Predicting Chemical Carcinogens

Sun, 03 Jan 1999 00:00:00 +0000

Classification trees, usually used as a nonlinear, nonparametric classification method, can also provide a powerful framework for comparing, assessing, and combining information from different expert systems, by treating their predictions as the independent variables in a classification tree analysis. This paper discusses the applied problem of classifying chemicals as human carcinogens. It shows how classification trees can be used to compare the information provided by ten different carcinogen classification expert systems, construct an improved "hybrid" classification system from them, and identify cost-effective combinations of assays (the inputs to the expert systems) to use in classifying chemicals in future.

Pattern Discovery via Entropy Minimization

Sun, 03 Jan 1999 00:00:00 +0000

We propose a framework for learning hidden-variable models by optimizing entropies, in which entropy minimization, posterior maximization, and free energy minimization are all equivalent. Solutions for the maximum a posteriori (MAP) estimator yield powerful learning algorithms that combine all the charms of expectation-maximization and deterministic annealing. Contained as special cases are the methods of maximum entropy, maximum likelihood, and a new method, maximum structure. We focus on the maximum structure case, in which entropy minimization maximizes the amount of evidence supporting each parameter while minimizing uncertainty in the sufficient statistics and cross-entropy between the model and the data. In iterative estimation, the MAP estimator gradually extinguishes excess parameters, sculpting a model structure that reflects hidden structures in the data. These models are highly resistant to over-fitting and have the particular virtue of being easy to interpret, often yielding insights into the hidden causes that generate the data.

Hierarchical IFA Belief Networks

Sun, 03 Jan 1999 00:00:00 +0000

We introduce a new real-valued belief network, which is a multilayer generalization of independent factor analysis (IFA). At each level, this network extracts real-valued latent variables that are non-linear functions of the input data with a highly adaptive functional form, resulting in a hierarchical distributed representation of these data. The network is based on a probabilistic generative model, constructed by cascading single-layer IFA models. Whereas exact maximum-likelihood learning for this model is intractable, we present and demonstrate an algorithm that maximizes a lower bound on the likelihood. This algorithm is developed by formulating a variational approach to hierarchical IFA networks.