Proceedings of Machine Learning Research

Proceedings of Machine Learning Research Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics Held in Fort Lauderdale, FL, USA on 04-07 January 1997 Published as Reissue 1 by the Proceedings of Machine Learning Research on 30 March 2021. Volume Edited by: David Madigan Padhraic Smyth Series Editors: Neil D. Lawrence https://proceedings.mlr.press/r1/ Wed, 08 Feb 2023 10:45:18 +0000 Wed, 08 Feb 2023 10:45:18 +0000 Jekyll v3.9.3 WWW Cache Layout to Ease Network Overload The GBI (graph-based induction) concept learning method is applied to extract typical access patterns of WWW data. By interpreting extracted patterns as the cache site layout we can reduce the total network data flow by implementing a distributed cache system which is adapted to the WWW access patterns. Although the huge WWW data flow causes the overflow of the conventional hierarchical cache system, the layout created by the GBI method eases this problem. The traffic reduction ratio of this distributed cache system is 2.5 times higher than that of the conventional hierarchical cache system. Our results suggest the importance of the data analyzing methods which can handle structured data. By analyzing regularity in graph structures, the GBI method can reduce the network data flow; The statistical criteria contribute to the analysis of promising patterns Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/yoshida97a.html https://proceedings.mlr.press/r1/yoshida97a.html MML Mixture Modelling of Multi-state, Poisson, vonMises circular and Gaussian Distributions Minimum Message Length (MML) is an invariant Bayesian point estimation technique which is also consistent and efficient. We provide a brief overview of MML inductive inference (Wallace and Boulton (1968) , Wallace and Freeman (1987)), and how it has both an information-theoretic and a Bayesian interpretation. We then outline how MML is used for statistical parameter estimation, and how the MML mixture modelling program, Snob (Wallace and Boulton (1968), Wallace (1986), Wallace and Dowe (1994)) uses the message lengths from various parameter estimates to enable it to combine parameter estimation with selection of the num- ber of components. The message length is (to within a constant) the logarithm of the posterior probability of the theory. So, the MML theory can also be re- garded as the theory with the highest posterior probability. Snob currently assumes that variables are uncorrelated, and permits multi-variate data from Gaussian, discrete multi-state, Poisson and von Mises circular distributions. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/wallace97a.html https://proceedings.mlr.press/r1/wallace97a.html Statistical Aspects of Classification in Drifting Populations This paper discusses ideas for adaptive learning which can capture dynamic aspects of real-world datasets. Broadly, we explore two approaches. The first examines ways o f updating the classification rule as suggested by some monitoring process (similar to those used in a quality control problem), and this is applied to linear, logistic and quadratic discriminant. The second approach examines nonparametric classifiers based explicitly on the data and ways in which the data can be dynamically adapted to improve the performance. These methods are tried out on simulated data and real data from the credit industry. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/taylor97a.html https://proceedings.mlr.press/r1/taylor97a.html On the Error Probability of Model Selection for Classification We consider model selection based on information criteria for classification. The information criterion is expressed in the form of the empirical entropy plus a compensation term $(k(g)/2)d(n)$, where $k(g)$ is the number of independent parameters in a model $g$, $d(n)$ is a function of $n$, and $n$ is the number of examples. First of all, we derive for arbitrary $d(\cdot)$ the asymptotically exact error probabilities in model selection. Although it was known for linear/autoregression processes that $d(n) = \log \log n$ is the minimum function of n such that the model selection satisfies strong consistency, the problem whether the same thing holds for classification has been open. We solve this problem affirmatively. Additionally, we derive for arbitrary $d(\cdot)$ the expected Kullback-leibler divergence between a true conditional probability and the conditional probability estimated by the model selection and the Laplace estimators. The derived value is $k(g^*)/(2n)$, where $g^*$ is a true model, and the accumulated value over $n$ time instances is $(k(g*)/2) \log n + 0(1)$, which implies the optimality of a predictive coding based on the model selection. Keywords: model selection, error probability, strong con- sistency, Kullback-Leibler divergence, minimum description length principle, Hannan/Quinn’s procedure, unseparated/separated models, Kolmogorov’s law of the iterated logarithm. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/suzuki97a.html https://proceedings.mlr.press/r1/suzuki97a.html A Polynomial Time Algorithm for Determining DAG Equivalence in the Presence of Latent Variables and Selection Bias Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/spirtes97b.html https://proceedings.mlr.press/r1/spirtes97b.html Heuristic Greedy Search Algorithms for Latent Variable Models There has recently been significant progress in the development of algorithms for learning the directed acyclic graph (DAG) part of a Bayesian network without latent variables from data and optional background knowledge. However, the problem of learning the DAG part of a Bayesian network with latent (unmeasured) variables is much more difficult for two reasons: first the number of possible models is infinite, and second, calculating scores for latent variables models is generally much slower than calculating scores for models without latent variables. In this paper we will describe how to extend search algorithms developed for non-latent variable DAG models to the case of DAG models with latent variables. We will introduce two generalizations of DAGs, called mixed ancestor graphs (or MAGs) and partial ancestor graphs (or PAGs), and briefly describe how they can be used to search for latent variable DAG models, to classify, and to predict the effects of interventions in causal systems. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/spirtes97a.html https://proceedings.mlr.press/r1/spirtes97a.html Cross-validated Likelihood for Model Selection in Unsupervised Learning Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cross-validated classification error (Breiman et al., 1984) and variable selection in linear regression using cross-validated predictive squared error (Hjort, 1995). Cross-validation is less seldomly used in \emph{unsupervised} learning such as clustering. It is popular in kernel density estimation for choosing the smoothing parameter (the kernel bandwidth). However, it does not appear to have been used for the problem of determining cluster structure in clustering problems, i.e., solving the problem of how many clusters to fit to a given data set. This is the problem addressed in this paper. Cross-validated likelihood can be viewed as an appropriate metric for model selection in probabilistic clustering, in particular for finite mixture models. In this paper, the use of cross-validated likelihood for clustering is investigated a.nd the method is applied to a real problem where "truth" is unknown, i.e., determining the number of intrinsic "regimes" in records of upper atmosphere pressure taken daily since 1948 over the Northern Hemisphere. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/smyth97a.html https://proceedings.mlr.press/r1/smyth97a.html An Incremental Construction of a Nonparametric Regression Model Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/smid97a.html https://proceedings.mlr.press/r1/smid97a.html A Distance Metric for Classification Trees CART is an unstable classifier resulting in significant changes in tree structure for small changes in the learning set (Breiman, Friedman et al. 1984; Breiman 1994). To address this problem, research into combining classifiers has increased significantly in the last few years (Breiman 1992; Wolpert 1992; Breiman 1994). These methods are of two basic types: concatenation uses the output from one classifier as input to the next classifier; parallel classifiers work on the same input data with the output from each classifier combined using regression or vote-counting techniques (Schurman 1996). These strategies greatly improve the predictive power of unstable classifiers. However, when the goal of the statistical analysis is to learn about the relationship between outcome and predictors, these strategies for combining classifiers are unacceptable since they produce a large number of trees, making interpretation difficult. We present a new method for combining classification trees which results in a single, interpretable tree. We begin by defining a distance metric between two trees based on the amount of rearrangement needed so that the structure of the two trees is identical. Using this distance metric, we develop the concept of the central, or median, tree structure and estimate it using a consensus rule. This tree is seen to be more centrally located than the tree fit to all the data. We finish by discussing future work including alternative methods for estimating the median tree, probability models, uses in data mining and meta-analysis, and performance measurements of the median tree. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/shannon97a.html https://proceedings.mlr.press/r1/shannon97a.html Estimating Latent Causal Inferences: Tetrad II model selection and Bayesian parameter estimation The statistical evidence for the detrimental effect of low level lead exposure on the cognitive capacities of children has been debated for several decades. In this paper I describe how two techniques from artificial intelligence and statistics proved crucial in making the statistical evidence for the accepted epidemiological conclusion seem decisive. The first is a variable-selection routine in TETRAD II, and the second a Bayesian estimation of the parameter reflecting the causal influence of Actual Lead Exposure, a latent variable, on the measured IQ score of middle class suburban children. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/scheines97a.html https://proceedings.mlr.press/r1/scheines97a.html Mixed Memory Markov Models We consider how to parameterize Markov models with prohibitively large state spaces. This is done by representing the transition matrix as a convex combination-or mixtureof simpler dynamical models. The parameters in these models admit a simple probabilistic interpretation and can be fitted iteratively by an Expectation-Maximization (EM) procedure. We give examples where these models may be a faithful and/or useful representation of the underlying dynamics. We also derive a set of generalized Baum-Welch updates for hidden Markov models (HMMs) that make use of this parameterization. Because these models decompose the hidden state as the Cartesian product of two or more random variables, they are well suited to the modeling of coupled time series. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/saul97a.html https://proceedings.mlr.press/r1/saul97a.html Applying a Gaussian-Bernoulli Mixture Model Network to Binary and Continuous Missing Data in Medicine We wish to train a feedforward projective-sigmoidal neural network (MLP) on breast cancer outcomes data missing both binary and continuous input variable values. A Gaussian-Bernoulli mixture model is trained on the data (using EM). It then performs stochastic imputation (filling in) of the missing values, as a preprocessor to the MLP. In order to compare predictive accuracy when the training data are complete vs. incomplete/imputed, we use only complete cases from a natural data set, but artificially remove 80% of their input data values. Very little difference is observed in the comparison, suggesting that the mixture model is quite effective here, despite the fact that more than 99% of the casesfmstances had had some missing value(s). The mixture model can be used both for output/outcome prediction by a trained MLP and for the training process itself. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/rosen97a.html https://proceedings.mlr.press/r1/rosen97a.html A Note on Cyclic Graphs and Dynamical Feedback Systems Directed acyclic graphical (DAG) models were motivated in large part by the desire to have a general formalism to represent causal hypotheses and the restrictions on probability distributions they imply. DAG models exploited a fundamental kinship in a variety of statistical formalisms often treated as distinct "models": factor models, structural equation models, regression models, logistic regression models, survival models, etc. The fundamental connection is through either of two equivalent (for DAGs) properties, a "local" Markov condition, or the property of d-separation, sometimes called the "global" Markov condition. (Pearl 1988, Lauritzen et al. 1990). In much the same spirit, directed cyclic graphs (DCGs) have been introduced to represent the causal ·structure of feedback processes and the restrictions on probability distributions those structures imply. Developments in our understanding of DCGs have proceeded so rapidly that it is appropriate to consider the prospects and limitations of cyclic representations of feedback systems. (For an alternative approach to extending graphical models to the temporal domain see Aliferis and Cooper, 1996) Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/richardson97b.html https://proceedings.mlr.press/r1/richardson97b.html Extensions of Undirected and Acyclic, Directed Graphical Models The use of acyclic, directed graphs (often called ’DAG’s) to simultaneously represent causal hypotheses and to encode independence and conditional independence constraints associated with those hypotheses has proved fruitful in the construction of expert systems, in the development of efficient updating algorithms (Pearl, 1988, Lauritzen et al. 1988), and in inferring causal structure (Pearl and Verma, 1991; Cooper and Herskovits 1992; Spirtes, Glymour and Scheines, 1993). In section 1 I will survey a number of extensions of the DAG framework based on directed graphs and chain graphs (Lauritzen and Wermuth 1989; Frydenberg 1990; Koster 1996; Andersson, Madigan and Perlman 1996). Those based on directed graphs include models based on directed cyclic and acyclic graphs, possibly including latent variables and/or selection bias (Pearl, 1988; Spirtes, Glymour and Scheines 1993; Spirtes 1995; Spirtes, Meek, and Richardson 1995; Richardson 1996a, 1996b; Koster 1996; Pearl and Dechter 1996; Cox and Wermuth, 1996). In section 2 I state two properties, motivated by causal and spatial intuitions, that the set of conditional independencies entailed by a graphical model might satisfy. I proceed to show that the sets of independencies entailed by (i) an undirected graph via separation, and (ii) a (cyclic or acyclic) directed graph (possibly with latent and/or selection variables) via ct-separation, satisfy both properties. By contrast neither of these properties, in general, will hold in a chain graph under the Lauritzen-Wermuth-Frydenberg (LWF) interpretation. One property holds for chain graphs under the Andersson-Madigan-Perlman (AMP) interpretation, the other does not. The examination of these properties and others like them may provide insight into the current vigorous debate concerning the applicability of chain graphs under different global Markov properties. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/richardson97a.html https://proceedings.mlr.press/r1/richardson97a.html Robust Parameter Learning in Bayesian Networks with Missing Data Bayesian Belief Networks (BBNs) are a powerful formalism for knowledge representation and reasoning under uncertainty. During the past few years, Artificial Intelligence met·Statistics in the quest to develop effective methods to learn BBNs directly from databases. Unfortunately, real-world databases include missing and/or unreported data whose presence challenges traditional learning techniques, from both the theoretical and computational point of view. This paper introduces a new method to learn the probabilities defining a BBNs from databases with missing data. The intuition behind this method is close to the robust sensitivity analysis interpretation of probability: the method computes the extreme points of the set of possible distributions consistent with the available information and proceeds by refining this set as more information becomes available. This paper outlines the description of this method and presents some experimental results comparing this approach to the Gibbs Samplings. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/ramoni97a.html https://proceedings.mlr.press/r1/ramoni97a.html Case-based Probability Factoring in Bayesian Belief Networks Bayesian network inference can be formulated as a combinatorial optimization problem, concerning in the computation of an optimal factoring for the distribution represented in the net. Since the determination of an optimal factoring is a computationally hard problem, heuristic greedy strategies able to find approximations of the optimal factoring are usually adopted. In the present paper we investigate an alternative approach based on a combination of genetic algorithms (GA) and case-based reasoning (CBR). We show how the use of genetic algorithms can improve the quality of the computed factoring in case a static strategy is used (as for the MPE computation), while the combination of GA and CBR can still provide advantages in the case of dynamic strategies. Some preliminary results on different kinds of nets are then reported. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/portinale97a.html https://proceedings.mlr.press/r1/portinale97a.html The Effects of Training Set Size on Decision Tree Complexity This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no significant increase in classification accuracy. Said differently, removing randomly selected training instances often results in trees that are substantially smaller and just as accurate as those built on all available training instances. This implies that decreases in tree size obtained by more sophisticated data reduction techniques should be decomposed into two parts: that which is due to reduction of training set size, and the remainder, which is due to how the method selects instances to discard. We perform this decomposition for one recent data reduction technique, John’s ROBUST-c4.5 (John 1995), and show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set. We conclude that random data reduction is a baseline against which more sophisticated data reduction techniques should be compared. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/oates97b.html https://proceedings.mlr.press/r1/oates97b.html A Family of Algorithms for Finding Temporal Structure in Data Finding patterns in temporally structured data is an important and difficult problem. Examples of temporally structured data include time series of economic indicators, distributed network status reports, and continuous streams such as flight recorder data. We have developed a family of algorithms for finding structure in multivariate, discrete-valued time series data (Oates & Cohen 1996b; Oates, Schmill, & Cohen 1996; Oates et al. 1995). In this paper, we introduce a new member of that family for handling event-based data, and offer an empirical characterization of a time series based algorithm. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/oates97a.html https://proceedings.mlr.press/r1/oates97a.html Combining Neural Network Regression Estimates Using Principal Components Combining a set of learned models1 to improve classification and regression estimates has been an area ofmuch research in machine learning and neural net- works ([Wolpert92, Merz95 , PerroneCooper92 , LeblancTibshirani93, Breiman92, Meir95, Krogh95, Tresp95, ChanStolfo95]). The challenge of this problem is to decide which models to rely on for prediction and how much weight to give each. The goal of combining learned models is to obtain a more accurate predic- tion than can be obtained from any single source alone. One major issue in combining a set of learned models is redundancy. Redundancy refers to the amount of agreement or linear dependence between models when making a set of predictions The more the set agrees, the more redundancy is present. In statistical terms, this is referred to as the multicollinearity problem. The focus of this paper is to describe and evaluate an approach for combining regression estimates based on principal components regression. The method, called PCR*, is then evaluated on several real-world domains to demonstrate its robustness versus a collection of existing techniques. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/merz97a.html https://proceedings.mlr.press/r1/merz97a.html An Objective Function for Belief Net Triangulation This paper presents a new approach to the triangulation of belief networks. Triangulation is a combinatorial optimization problem; our idea is to embed its discrete domain into a continuous domain e. Then, by suitably extending the objective function over e, we can make use of continuous optimization techniques to do the minimization. We used an upper bound of the total junction tree weight as the cost function. The appropriateness of this choice is discussed and explored by simulations. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/meila97a.html https://proceedings.mlr.press/r1/meila97a.html How to Find Big-Oh in Your Data Set (and How Not To) The \emph{empirical curve bounding problem} is defined as follows. Suppose data vectors $X$, $Y$ are presented such that $E(Y[i]) = \bar{f}(X[i])$ where $\bar{f}(x)$ is an unknown function. The problem is to analyze $X$, $Y$ and obtain complexity bounds $O(g_u(x))$ and $\Omega(g_l(x))$ on the function $\bar{f}(x)$. As no algorithm for empirical curve bounding can be guaranteed correct, we consider heuristics. Five heuristic algorithms are presented here, together with analytical results guaranteeing correctness for certain families of functions. Experimental evaluations of the correctness and tightness of bounds obtained by the rules for several constructed functions $f(x)$ and real datasets are described. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/mcgeoch97a.html https://proceedings.mlr.press/r1/mcgeoch97a.html Preface Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/madigan97a.html https://proceedings.mlr.press/r1/madigan97a.html Conceptual Clustering with Numeric-and-Nominal Mixed Data - A New Similarity Based System This paper presents a new Similarity Based Agglomerative Clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy[13], that gives greater weight to uncommon feature-value matches in similarity computations and makes no assumptions of the underlying distributions of the feature-values, is adopted to define the similarity measure between pairs of objects. An agglomerative algorithm is employed to construct a concept tree, and a simple distinctness heuristic is used to extract a partition of the data. The performance of SBAC has been studied on artificially generated data sets. Results demonstrate the effectiveness of this algorithm in unsupervised discovery tasks. Comparisons with other schemes illustrate the superior performance of the algorithm. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/li97a.html https://proceedings.mlr.press/r1/li97a.html Approximate Inference and Forecast Algorithms in Graphical Models for Partially Observed Dynamic Systems From a statistical point of view , modelling stochastic temporal processes by graphical models is a suitable choice, specially when certain standard assumptions in classical modelling cannot be assumed. Focusing the discussion on partially observed domains, it is important to design algorithms which provide probability distributions over the current and future states of the non-observable components of the domain, using the information stored in the observable components. In this paper, we present a simulation algorithm for approximating the exact probability distributions associated with such inference and forecast processes. This algorithm uses both the probabilities built at the previous time step and the new evidence obtained to propose new probability distributions associated with current and future states of the domain. To validate the algorithm, a case study of equipment maintenance is considered. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/lekuona97a.html https://proceedings.mlr.press/r1/lekuona97a.html Comparing Predictive Inference Methods for Discrete Domains Predictive inference is seen here as the process of determining the predictive distribution of a discrete variable, given a data set of training examples and the values for the other problem domain variables. We consider three approaches for computing this predictive distribution, and assume that the joint probability distribution for the variables belongs to a set of distributions determined by a set of parametric models. In the simplest case, the predictive distribution is computed by using the model with the \emph{maximum a posteriori (MAP)} posterior probability. In the \emph{evidence} approach, the predictive distribution is obtained by averaging over all the individual models.in the model family. In the third case, we define the predictive distribution by using Rissanen’s new definition of \emph{stochastic complexity}. Our experiments performed with the family of Naive Bayes models suggest that when using all the data available, the stochastic complexity approach produces the most accurate predictions in the log-score sense. However, when the amount of available training data is decreased, the evidence approach clearly outperforms the two other approaches. The MAP predictive distribution is clearly inferior in the log-score sense to the two more sophisticated approaches, but for the 0/1-score the MAP approach may still in some cases produce the best results. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/kontkanen97a.html https://proceedings.mlr.press/r1/kontkanen97a.html Bayesian Information Retrieval: Preliminary Evaluation Given a database of documents and a user’s query, how can we locate those documents that meet the user’s information needs? Because there is no precise definition of which documents in the database match the user’s query, uncertainty is inherent in the information retrieval process. Therefore, probability theory is a natural tool for formalizing the retrieval task. In this paper, we propose a Bayesian approach to one of the conventional probabilistic information retrieval models. We discuss the motivation for such a model, describe its implementation, and present some experimental results. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/keim97a.html https://proceedings.mlr.press/r1/keim97a.html Adjusting for Multiple Testing in Decision Tree Pruning Overfitting is a widely observed pathology of induction algorithms. Overfitted models contain unnecessary structure that reflects nothing more than random variation in the data sample used to construct the model. Portions of these models are literally wrong, and can mislead users. Overfitted models are less efficient to store and use than their correctly-sized counterparts. Finally, overfitting can reduce the accuracy of induced models on new data [14, 7]. For induction algorithms that build decision trees [1, 13, 15], pruning is a common approach to correct overfitting. Pruning techniques take an induced tree, examine individual subtrees, and remove those subtrees deemed to be unnecessary. While pruning techniques can differ in several respects, they primarily differ in the criterion used to judge subtrees. Many criteria have been proposed, including statistical significance tests [13], corrected error estimates [15], and minimum description length calculations [12]. Most common pruning techniques, however, do not account for one potentially important factor - multiple testing. Multiple testing occurs whenever an induction algorithm examines several candidate models and selects the one that best accords with the data. Any search process necessarily involves multiple testing, and most common induction algorithms involve implicit or explicit search through a space of candidate models. In the case of decision trees, search involves examining many possible subtrees and selecting the best one. Pruning techniques need to account for the number of subtrees examined, because such multiple testing affects the apparent accuracy of models on training data [8]. This paper examines the importance of adjusting for multiple testing. Specifically, it examines the effectiveness of one particular pruning method - \emph{bonferroni pruning}. Bonferroni pruning adjusts the results of a standard significance test to account for the number of subtrees examined at a particular node of a decision tree. Evidence that bonferroni pruning leads to better models supports the hypothesis that multiple testing is an important cause of overfitting. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/jensen97a.html https://proceedings.mlr.press/r1/jensen97a.html A Variational Approach to Bayesian Logistic Regression Models and their Extensions We consider a logistic regression model with a Gaussian prior distribution over the parameters. We show that accurate variational techniques can be used to obtain a closed form posterior distribution over the parameters given the data thereby yielding a posterior predictive model. The results are readily extended to (binary) belief networks. For belief networks we also derive closed form posteriors in the presence of missing values. Finally, we show that the dual of the regression problem gives a latent variable density model, the variational formulation of which leads to exactly solvable EM updates. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/jaakkola97a.html https://proceedings.mlr.press/r1/jaakkola97a.html Robust Interpretation of Neural Network models Artificial Neural Network seem very promising for regression and classification, especially for large covariate spaces. These methods represent a non-linear function as a composition of low dimensional ridge functions and therefore appear to be less sensitive to the dimensionality of the covariate space. However, due to non uniqueness of a global minimum and the existence of (possibly) many local minima, the model revealed by the network is non stable. We introduce a method to interpret neural network results which uses novel robustification techniques. This results in a robust interpretation of the model employed by the network. Simulated data from known models is used to demonstrate the interpretability results and to demonstrate the effects of different regularization methods on the robustness of the model. Graphical methods are introduced to present the interpretation results. We further demonstrate how interaction between covariates can be revealed. From this study we conclude that the interpretation method works well, but that NN models may sometimes be misinterpreted, especially if the approximations to the true model are less robust. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/intrator97a.html https://proceedings.mlr.press/r1/intrator97a.html Wavelet based Random Densities In this paper we describe theoretical properties of wavelet-based random densities and give algorithms for their generation. We exhibit random densities subject to some standard constraints: smoothness, modality, and skewness. We also give a relevant example of use of random densities. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/insua97a.html https://proceedings.mlr.press/r1/insua97a.html A Comparison of Scientific and Engineering Criteria for Bayesian Model Selection Given a set of possible model structures for variables $\mathbf{X}$ and a set of possible parameters for each structure, the Bayesian "estimate" of the probability distribution for $\mathbf{X}$ given observed data is obtained by averaging over the possible model structures and their parameters. An often-used approximation for this estimate is obtained by selecting a single model structure and averaging over its parameters. The approximation is useful because it is computationally efficient, and because it provides a model that facilitates understanding· of the domain. A common criterion for model selection is the posterior probability of the model. Another criterion for model selection, proposed by San Martini and Spezzafari (1984) , is the predictive performance of a model for the next observation to be seen. From the standpoint of domain understanding, both criteria are useful, because one identifies the model that is most likely, whereas the other identifies the model that is the best predictor of the next observation. To highlight the difference, we refer to the posterior-probability and alternative criteria as the \emph{scientific criterion} (SC) and \emph{engineering criterion} (EC), respectively. When we are interested in predicting the next observation, the model-averaged estimate is at least as good as that produced by EC, which itself is at least as good as the estimate produced by SC. We show experimentally that, for Bayesian-network models containing discrete variables only, differences in predictive performance between the model-averaged estimate and EC and between EC and SC can be substantial. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/heckerman97a.html https://proceedings.mlr.press/r1/heckerman97a.html Asessing and Improving Classification Rules The last few years have witnessed a resurgence of research effort aimed at developing improved techniques for supervised classification problems. In a large part this resurgence of interest has been stimulated by the novelty of multi-layer feedforward neural networks (Hertz et al, 1991; Ripley, 1996) and similar complex and flexible models such as MARS (Friedman, 1991), projection pursuit regression (Friedman and Stuetzle, 1981), and additive models in general (Hastie and Tibshirani, 1990)). The flexibility of these models is in striking contrast to the simplicity of models such as simple linear discriminant analysis, perceptrons, and logistic discriminant analysis, which assume highly restricted forms of decision surface. The merit of the flexibility of neural networks is countered by the dangers that they will overfit the design data. This relationship between model flexibility and the danger of overfitting has long been understood within the statistical community. For example, in the 1960s the optimistic bias of resubstitution error rate became widely recognised and it was replaced by the leave-one-out method as the method of choice. (Later, in the 1980s, an apparently large variance of the latter led to its being abandoned in favour of bootstrap methods, in particular the 632 bootstrap.) Early work on neural networks also fell into this trap, producing inflated claims of the performance of such models .derived from optimistic performance measures based on overfitting the design set. In recent years the risk has been recognised, and some sophisticated proposals have been made for overcoming the problem. They are based on ideas such as penalising the goodness of fit measure (by combining it with a measure of model complexity), restricting the form of the model (to few nodes in a network, for example), shrinking an overfitted model .(by weight decay, for example), or even by adding randomly perturbed replicates to the design set. The problem with all such methods is \emph{how} to strike the optimum compromise between modelling the design data and overfitting. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/hand97a.html https://proceedings.mlr.press/r1/hand97a.html On Predictive Classification of Binary Vectors The problem of rational classification of a database of binary vectors is analyzed by means of a family of Bayesian predictive distributions on the binary hypercube. The general notion of predictive classification was probably first discussed by S. Geisser. The predictive distributions are expressed in terms of a finite number observables based on a given set of binary vectors (predictors or centroids) representing a system of classes and an entropy-maximizing family of probability distributions. We derive the (non-probabilistic) criterion of maximal predictive classification due to J . Gower (1974) as a special case of a Bayesian predictive classification. The notion of a predictive distribution will be related to stochastic complexity of a set of data with respect to a family of statistical distributions. An application to bacterial identification will be presented using a database of Enterobacteriaceae as in Gyllenberg (1996 c). A framework for the analysis is provided by a theorem about the merging of opinions due to Blackwell and Dubins (1962). We prove certain results about the asymptotic properties of the predictive learning process. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/gyllenberg97a.html https://proceedings.mlr.press/r1/gyllenberg97a.html Intelligent Assistant for Computational Scientists: Integrated Modelling, Experimentation and Analysis Computing technology has changed the way scientists work. Among the contributions of this new paradigm are the computational sciences, which involve the study of computer simulations rather than physical systems. This transition to a simulated world carries with it an important scientific advantage: the opportunity to run experiments that are expensive, dangerous, or impossible in the real world. Unfortunately, such experiments are often too easy, and the scientist is overwhelmed with empirical data. The fields of Artificial Intelligence (AI) and Statistics are concerned with modelling and analyzing such large bodies of data. AI employs heuristic reasoning and knowledge to select potential models, and statistical analysis verifies a proposed model. The combinatio of knowledge-based, heuristic, and statistical techniques is quite successful at modelling experiment data (e.g. [11]). Our goal is to provide an intelligent, integrated environment for scientific modelling, experimentation , and analysis, called the Scientist ’s Empirical Assistant (SEA). SEA is an assistant to human scientists: it automates model generation and verification, experiment design and data collection, but also relies on a human user for guidance, domain knowledge, and decision-making. SEA designs and runs prospective experiments with a simulator, allowing it to draw stronger conclusions than with post-hoc data analysis alone. SEA employs a variety of techniques from both AI and Statistics. It uses heuristic and knowledge-based reasoning to propose models, design experiments, and select analyses. It applies statistical techniques to verify models against experiment data. It develops plans to direct its course of action, and learns which plans are most successful based on past experience. Finally, it models the knowledge of the user to ensure its suggestions and decisions are appropriate. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/gregory97a.html https://proceedings.mlr.press/r1/gregory97a.html Multivariate Density Factorization for Independent Component Analysis: An Unsupervised Artificial Neural Network Approach We propose a novel homogenous nonlinear self-organising network which employs solely computationally simple hebbian and anti-hebbian learning, in approximating a linear independent component analysis (ICA). The learning algorithms diagonalise the transformed data covariance matrix and approximate an orthogonal rotation which maximises the sum offourth order cumulants. This provides factorisation of the input multivariate density into the individual independent latent marginal densities. We apply this network to linear mixtures of data, which are inherently non-gaussian and have both Laplacian and bi-modal probability densities. We show that the proposed network is capable of factorising multivariate densities which are linear mixtures of independent latent playkurtic, leptokurtic and uniform distributions. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/girolami97a.html https://proceedings.mlr.press/r1/girolami97a.html Variational Inference for continuous Sigmoidal Bayesian Networks Latent random variables can be useful for modelling covariance relationships between observed variables. The choice of whether specific latent variables ought to be continuous or discrete is often an arbitrary one. In a previous paper, I presented a "unit" that could adapt to be continuous or binary, as appropriate for the current problem, and showed how a Markov chain Monte Carlo method could be used for inference and parameter estimation in Bayesian networks of these units. In this paper, I develop a variational inference technique in the hope that it will prove to be more computationally efficient than Monte Carlo methods. After presenting promising \emph{inference} results on a toy problem, I discuss why the variational technique does not work well for \emph{parameter estimation} as compared to Monte Carlo. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/frey97a.html https://proceedings.mlr.press/r1/frey97a.html A Characterization of Bayesian Network Structures and its Application to Leaming We present an analysis of the minimal I-map relation between Bayesian network structures and dependency models. This includes a partial order characterisation of the structures, and the connection between the relation and the arc reversal operation. Two applications of this analysis are presented. The first is a simple condition for identifying equivalence between Bayesian network structures, and the second is an exact learning algorithm based on the partial order characterisation. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/forbes97a.html https://proceedings.mlr.press/r1/forbes97a.html Inference using Probabilistic Concept Trees Discussions of ’probabilistic reasoning systems’ often presuppose a belief network, which represents the joint probability distribution of a domain, as the primary knowledge structure. However, another common knowledge structure from which the joint probability distribution can be recovered is a hierarchical probabilistic clustering or probabilistic concept tree (Fisher, 1987). Probabilistic concept trees are a target structure for a number of clustering systems from machine learning such as COBWEB (Fisher, 1987) and systems by Hadzikadik and Yun (1989), Gennari, Langley, and Fisher (1989), Decaestecker (1991), Anderson and Matessa (1991), Reich and Fenves (1991), Biswas, Weinberg, and Li (1994), De Alte Da Veiga (1994), Kilander (1994) Ketterlin, Gan{\c}arski, and Korczak (1995), and Nevins (1995). Related probabilistic structures are produced by systems such as AUTOCLASS (Cheeseman, Kelly, Self, Stutz, Taylor, &Freeman, 1988), SNOB (Wallace &Boulton, 1968; Wallace & Dowe, 1994) , and systems by Hanson and Bauer (1989) and Martin and Billman (1994). These systems can be easily adapted to form probabilistic concept trees of the type we describe. This paper will not focus on clustering systems \emph{per se}, but on characteristics and capabilities of probabilistic concept trees, particularly as they relate to inference tasks often associated with belief networks. As ’object-centered’ knowledge structures, probabilistic concept trees nicely complement the ’variable-centered’, belief network structure. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/fisher97a.html https://proceedings.mlr.press/r1/fisher97a.html Leaming Influence Diagram from Data There are many cases where decisions are made (and actions are taken) repeatedly under uncertainty, and consequences (results) ofthose decisions are available. For example, in telecommunications industry repeatedly decisions are made every day for fraud detection and account treatment. Indicators (variables) that have large uncertainties are used to make these decisions. Furthermore, the consequences of such decisions are recorded for later analysis. Similarly, in the financial industry, stocks or currencies are traded based on some indicators (variables). The consequences of these trade can be found. Similarly in the medicine, the patient treatment decisions are made on the basis of the patient information, and the consequences of these decisions to the patients can be found. These data sets contain uncertain variables, decision variables, and value lottery (final outcomes). Furthermore these decisions may not be made not by a single decision maker, but by many decision makers. In contrast to a typical decision analysis, in these environments decisions are made repeatedly. This paper addresses the discovery of knowledge bearing on these decisions in the form of influence diagrams (normative decision models) using a novel supervised machine learning method that constructs Bayesian network models with decisions. Algorithms presented in this paper exploit the goal oriented characteristics o f influence diagrams and generate a specific form of influence diagrams that are efficient, both to learn and evaluate. For this reason they are called "efficient" influence diagrams. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/ezawa97a.html https://proceedings.mlr.press/r1/ezawa97a.html Inductive Inference of First-Order Models from Numeric-Symbolic Data A factor common to statistical techniques of data analysis is the adopted representation formalism: A tabular (zeroth-order) model with almost exclusively numerical features . On the contrary, several studies on machine learning concern the induction of first-order models from symbolic data, but are inadequate for continuous data. In the paper, we face the problem of handling both numerical and symbolic data in first-order models. distinguishing the moment of model generation from examples (induction) from the moment of model recognition by means of a flexible. probabilistic subsumption test. We demonstrate the proposed solutions on a problem in document understanding where the objective is to induce the models of the logical structure of some real business letters. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/esposito97a.html https://proceedings.mlr.press/r1/esposito97a.html Memory Based Stochastic Optimization for Validation and Tuning of Function Approximators This paper focuses on the optimization of hyper-parameters for function approximators. We describe a kind of racing algorithm for continuous optimization problems that spends less time evaluating poor parameter settings and more time honing its estimates in the most promising regions of the parameter space. The algorithm is able to automatically optimize the parameters of a function approximator with less computation time. We demonstrate the algorithm on the problem of finding good parameters for a memory based learner and show the tradeoffs involved in choosing the right amount of computation to spend on each evaluation. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/dubrawski97a.html https://proceedings.mlr.press/r1/dubrawski97a.html Bayesian Model Averaging in Rule Induction Bayesian model averaging (BMA) can be seen as the optimal approach to any induction task. It can reduce error by accounting for model uncertainty in a principled way, and its usefulness in several areas has been empirically verified. However, few attempts to apply it to rule induction have been made. This paper reports a series of experiments designed to test the utility of BMA in this field. BMA is applied to combining multiple rule sets learned from different subsets of the training data, to combining multiple rules covering a test example, to inducing technical rules for foreign exchange trading, and to inducing conjunctive concepts. In the first two cases, BMA is observed to produce lower accuracies than the ad hoc methods it is compared with. In the last two cases, BMA is observed to typically produce the same result as simply using the best (maximum-likelihood) rule, even though averaging is performed over all possible rules in the space, the domains are highly noisy, and the samples are medium- to small-sized. In all cases, this is observed to be due to BMA’s consistent tendency to assign highly asymmetric weights to different models, even when their accuracy differs by little, with most models (often all but one) effectively having no influence on the outcome. Thus the effective number of models being averaged is much smaller for BMA than for common ad hoc methods, leading to a smaller reduction in variance. This suggests that the success of the multiple models approach to rule induction is primarily due to this variance reduction, and not to its being a closer approximation to the Bayesian ideal. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/domingos97a.html https://proceedings.mlr.press/r1/domingos97a.html PAC Learning with Constant-Partition Classification Noise and Applications to Decision Tree Induction We consider the problem of concept learning in Valiant’s PAC learning model in which the data used for learning is noisy. Specifically, we introduce a new model of noise called \emph{constant-partition classification noise} (CPCN) which generalizes the standard model of classification noise to allow different examples to have different rates of random misclassification. One example of CPCN type noise is data with differing rates of false positives and false negatives. We then show how to learn in the presense of CPCN for any concept class learnable by statistical queries. This set of classes includes every concept class known to be learnable in the presense of standard classification noise. Our model is the first such non-uniform generalization of the standard classification noise model that allows efficient learning of this wide range of concept classes. We then examine standard methods of decision tree induction in the context of noisy data. We observe that the core of commonly used algorithms such as ID3, CART and c4.5 are not robust to CPCN noise, or even to standard classification noise. We therefore propose a simple modification to these algorithms in order to make them robust against CPCN. The modification is based on the statistical query techniques for CPCN described above. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/decatur97a.html https://proceedings.mlr.press/r1/decatur97a.html Dataset Cataloging Metadata for Machine Learning Applications Research As the field of machine learning (ML) matures, two types of data archives are developing: collections of benchmark data sets used to test the performance of new algorithms, and data stores to which machine learning/data mining algorithms are applied to create scientific or commercial applications. At present, the catalogs of these archives are ad hoc and not tailored to machine learning analysis. This paper considers the cataloging metadata required to support these two types of repositories, and discusses the organizational support necessary for archive catalog maintenance. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/cunningham97a.html https://proceedings.mlr.press/r1/cunningham97a.html Using Classification Trees to Improve Causal Inferences in Observational Studies Much of the recent literature on AI and statistics has focused on how to use causal knowledge to enrich the set of valid causal inferences that can be drawn from available data and applied to practical problems such as decision-making, probabilistic diagnosis, and cost-effective control of systems. This paper examines classical problems of valid causal inference in observational studies, using epidemiological studies on the association between exposure to diesel exhaust (DE) and risk of lung cancer as a case study. It shows that one of the main applied computational tools of AI and statistics, classification tree analysis, can be adapted to help control or avoid many of the usual statistical threats to valid causal inference, and links this new use of classification trees to an established older literature on techniques for causal inference in social statistics based on elimination of competing (non-causal) explanations for observed associations. A strong link is then forged between an extension of classification tree analysis and modem AI and statistics approaches to causal modeling and inference based in directed acyclic graph (DAG) causal models and influence diagrams. This new link is based on the observation that classification tree analysis can be adapted to test the local Markov conditions that provide the critical defining structure of DAG models, as well as to quantify the conditional distributions of variables given the values of their parents - the key numerical information needed to quantify an influence diagram model. Finally, these insights are applied to available data on DE and lung cancer risks and are used to conclude that there is no evidence of a causal relation between them. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/cox97a.html https://proceedings.mlr.press/r1/cox97a.html Overfitting Explained Overfitting arises when model components are evaluated against the wrong reference distribution. Most modeling algorithms iteratively find the best of several components and then test whether this component is good enough to add to the model. We show that for independently distributed random variables, the reference distribution for any one variable underestimates the reference distribution for the the highest-valued variable; thus variate values will appear significant when they are not, and model components will be added when they should not be added. We relate this problem to the well-known statistical theory of multiple comparisons or simultaneous inference. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/cohen97a.html https://proceedings.mlr.press/r1/cohen97a.html Strategies for Model Mixing in Generalized Linear Models In linear regression models and generalized linear regression models (GLMs), there is often substantial uncertainty about the choice of covariates to include in the model. Both classical and Bayesian approaches that involve selecting a subset of covariates and making inferences conditional on that model choice ignore a major component of uncertainty in the problem. One approach for incorporating this form of model uncertainty into the analysis is by directly building into the model a vector of indicator variables $Y$ that reflects which covariates are included in the model ... Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/clyde97a.html https://proceedings.mlr.press/r1/clyde97a.html A Bayesian approach to CART A Bayesian approach for finding classification and regression tree (CART) models is proposed. By putting an appropriate prior distribution on the space of CART models, the resulting posterior will put higher probability on the more "promising trees". In particular, priors are proposed which penalize complexity by putting higher probability on trees with fewer nodes. Metropolis-Hastings algorithms are used to rapidly grow trees in such a way that the high posterior probability trees are more likely to be obtained. In effect, the algorithm performs a stochastic search for promising trees. Examples are used to illustrate the potential superiority of this approach over conventional greedy methods. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/chipman97a.html https://proceedings.mlr.press/r1/chipman97a.html An Algorithm for Bayesian Network Construction from Data This paper presents an efficient algorithm for constructing Bayesian belief networks from databases. The algorithm takes a database and an attributes ordering (i.e., the causal attributes of an attribute should appear earlier in the order) as input and constructs a belief network structure as output. The construction process is based on the computation of mutual information and cross entropy of attribute pairs. This algorithm guarantees that the \emph{minimal Independent map} [1] of the underlying dependency model is generated, and at the same time, enjoys the time complexity of $O(N^2)$ on conditional independence (Cl) tests. To evaluate this algorithm, we present the experimental results on three versions of the well-known ALARM network database, which has 37 attributes and 10,000 records. The correctness proof and the analysis of computational complexity are also presented. We also discuss the features ofour work and relate it to previous works. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/cheng97a.html https://proceedings.mlr.press/r1/cheng97a.html A Forward Monte Carlo Method for Solving Influence Diagrams using local Computation The main goal of this paper is to describe a Monte Carlo method for solving influence diagrams using local computation. The forward Monte Carlo sampling technique draws independent and identically distributed observations. Methods that have been proposed in this spirit sample from the entire distribution. However, when the number of variables is large, the state space of all variables is exponentially large, and the sample size required for good estimates is too large to be practical. In the forward Monte Carlo method we generate observations from a subset of chance variables for each decision node in the influence diagram. We use methods developed for exact solution of influence diagrams to limit the number of chance variables sampled at any time. Because influence diagrams model each chance variable with a conditional probability distribution, the forward Monte Carlo solution method lends itself very well to influence-diagram representations. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/charnes97a.html https://proceedings.mlr.press/r1/charnes97a.html Comparing Tree-Simplification Procedures Induced decision trees are frequently used by researchers in the machine learning and statistics communities to solve classification tasks (Breiman et al. 1984; Quinlan 1993). However, their practical utility is limited by difficulties users have in comprehending them due to their size and complexity. Many methods have been proposed to simplify decision trees, but their relative capabilities are largely unknown; their evaluation is usually limited to comparisons with "bench-mark" systems (e.g., C4.5, CART). This paper presents a categoriZation framework for tree-simplification methods and focuses on the empirical comparison of selected methods. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/breslow97a.html https://proceedings.mlr.press/r1/breslow97a.html Using Prediction to Improve Combinatorial Optimization Search This paper describes a statistical approach to improving the performance of stochastic search algorithms for optimization. Given a search algorithm $A$, we learn to predict the outcome of $A$ as a function of state features along a search trajectory. Predictions are made by a function approximator such as global or locally-weighted polynomial regression; training data is collected by Monte-Carlo simulation. Extrapolating from this data produces a new evaluation function which can bias future search trajectories toward better optima. Our implementation of this idea, STAGE, has produced very promising results on two large-scale domains. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/boyan97a.html https://proceedings.mlr.press/r1/boyan97a.html Integrating Signal and Language Context to Improve Handwritten Phrase Recognition: Alternative Approaches Handwritten phrase recognition is an important and difficult task. Recent research in this area has fo- cussed on utilising language context to improve recognition performance, without taking the information from the input signal itself into proper account. In this paper, we adopt a Bayesian approach to solving this problem. The Bayesian framework allows us to integrate signal-level information from the actual input with the linguistic context usually used in post-processing the recogniser’s output. We demonstrate the validity of a statistical approach to integrating these two sources of information. We also analyse the need for improvement in performance through innovative estimation of informative priors, and describe our method for obtaining agreement from multiple experts for this task. We compare the performance of our integrated signal-language model against existing "language-only" models. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/bouchaffra97a.html https://proceedings.mlr.press/r1/bouchaffra97a.html A Comparison of Decision Trees, Influence Diagrams and Valuation Networks for Asymmetric Decision Problems We compare three graphical techniques for representation and solution of asymmetric decision problems- decision trees, influence diagrams, and valuation networks. We solve a modified version of Covaliu and Oliver’s Reactor problem using each of the three techniques. For each technique, we highlight the strengths, intrinsic weaknesses, and shortcomings that perhaps can be overcome by further research. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/bielza97b.html https://proceedings.mlr.press/r1/bielza97b.html Markov chain Monte Carlo methods for decision analysis This paper considers an aspect of mixture modelling. Previous studies have shown minimum message length (MML) estimation to perform well in a wide variety of mixture modelling problems, including determining the number of components which best describes some data. In this paper, we focus on the difficult problem of overlapping components. An advantage of the probabilistic mixture modelling approach is its ability to identify models where the components overlap and data items can belong probabilistically to more than one component. Significantly overlapping distributions require more data for their parameters to be accurately estimated than well separated distributions. For example, two Gaussian distributions are considered to significantly overlap when their means are within three standard deviations of each other. If insufficient data is available, only a single component distribution will be estimated, although the data originates from two component distributions. In this paper, we quantify this difficulty in terms of the number of data items needed for the MML criterion to ’discover’ two overlapping components. First, we perform experiments which compare the MML criterion’s performance relative to other Bayesian criteria based on MCMC sampling. Second, we make two alterations to the existing MML estimates in order to improve its performance on overlapping distributions. Experiments are performed with the new estimates to confirm that they are effective. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/bielza97a.html https://proceedings.mlr.press/r1/bielza97a.html Finding Overlapping Distributions with MML This paper considers an aspect of mixture modelling. Previous studies have shown minimum message length (MML) estimation to perform well in a wide variety of mixture modelling problems, including determining the number of com- ponents which best describes some data. In this paper, we focus on the difficult problem of overlapping components. An advantage of the probabilistic mixture modelling approach is its ability to identify models where the components overlap and data items can belong prob- abilistically to more than one component. Significantly overlapping distributions require more data for their parameters to be accurately estimated than well sep- arated distributions. For example, two Gaussian distributions are considered to significantly overlap when their means are within three standard deviations of each other. If insufficient data is available, only a single component distribution will be estimated, although the data originates from two component distributions. In this paper, we quantify this difficulty in terms of the number of data items needed for the MML criterion to ’discover’ two overlapping components. First, we perform experiments which compare the MML criterion’s performance relative to other Bayesian criteria based on MCMC sampling. Second, we make two alterations to the existing MML estimates in order to improve its performance on overlapping distributions. Experiments are performed with the new estimates to confirm that they are effective. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/baxter97a.html https://proceedings.mlr.press/r1/baxter97a.html Building an EDA Assistant: A Progress Report Since 1993 we have been working on a system to help people with exploratory data analysis (EDA). AIDE, an Assistant for Intelligent Data Exploration, is a knowledge-based planning system that incrementally explores a dataset, guided by user directives and its own evaluation of indications in the data. Its plan library contains strategies for generating and interpreting indications in data, selecting techniques to build appropriate descriptions of data, carrying out relevant procedures, and combining individual results into a coherent larger picture. The system is mixed-initiative, autonomously pursuing high- and low-level goals while still allowing the user to inform or override its decisions. Elsewhere we have described AIDE’s operations and primitive data structures [22], its planning representation [23], its user interface [25, 24], and the system as a whole [21]. This progress report discusses a recent evaluation we conducted with AIDE and explains why we believe that this line of research is important to AI and statistics researchers. We will begin with a very brief overview of the system. The bulk of the paper describes the evaluation, our analysis of the results, and the lessons we learned through the experience of building and evaluating AIDE. We end with a discussion of the generality of our results and the potential for future work. Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/amant97a.html https://proceedings.mlr.press/r1/amant97a.html Graphical Model Based Computer Adaptive Testing This paper synthesizes ideas from the fields of graphical modelling and eductational testing, particularly Item Response Theory (IRT) applied to Computerized Adaptive Testing (CAT). Graphical modelling can offer IRT a language for describing multifaceted skills and knowledge and disentangling evidence from com- plex performances. IRT-CA T can offer graphical modellers several ways of treating sources of variability other than including more variables in the model. In particular, variables can enter into the modelling pro- cess at several levels: (1) in validity studies (but not in the ordinarily used model), (2) in task construction (in particular, in defining link parameters), (3) in test or model assembly (blocking and randomization con- straints in selecting tasks or other model pieces), (4) in response characterization (i.e. as part of task models which characterize a response) or (5) in the main (student) model. The paper describes an implementation of these ideas in a fielded application: HYDRIVE, a tutor for hydraulics diagnosis Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/almond97b.html https://proceedings.mlr.press/r1/almond97b.html Intelligent Support of Secondary Data Analysis Sun, 05 Jan 1997 00:00:00 +0000 https://proceedings.mlr.press/r1/almond97a.html https://proceedings.mlr.press/r1/almond97a.html