Proceedings of Machine Learning ResearchPre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics
Held in Fort Lauderdale, FL, USA on 04-07 January 1995
Published as Reissue 0 by the Proceedings of Machine Learning Research on 01 May 2022.
Volume Edited by:
Doug Fisher
Hans-Joachim Lenz
Series Editors:
Neil D. Lawrence
https://proceedings.mlr.press/r0/
Sat, 07 May 2022 07:06:22 +0000Sat, 07 May 2022 07:06:22 +0000Jekyll v3.9.2Learning from Data by Guiding the Analyst: On the Representation, Use and Creation of Visual Statistical StrategiesThe concept of statistical strategy is introduced and used to develop a structured graphical user interface for guiding data analysts so that they can learn about the structure of their data. The interface visually represents statistical strategies that are designed by expert data analysts to guide novices. The representation is an abstraction of the expert’s concepts of the essence of a data analysis. The interface consists of two interacting windows: the guidemap and the workmap. An example is shown in Figure 1 (a screen image from UiSta (Young, 1994), software that implements the ideas in this paper).Each window contains a graph which has nodes and edges. The guidemap graph represents the statistical strategy for a specific statistical task (such as describing data). Nodes represent potential data-analysis actions that can be taken by the system. Edges represent potential actions that can be taken by the analyst. The guidemap graph exists prior to the data-analysis session, having been created by an expert. The workmap graph represents the complete history of all steps taken by the data analyst. It is constructed during the data-analysis session as a result of the analyst’s actions. Workmap nodes represent datasets, data models, or data-analysis procedures which have been created or used by the analyst. Workmap edges represent the chronological sequence of the analyst’s actions. One workmap node is highlighted to show which statistical object is the focus of the strategy.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/young95a.html
https://proceedings.mlr.press/r0/young95a.htmlLikelihood-based Causal InferenceA method is given which uses subject matter assumptions to discriminate recursive models and thus point toward possible causal explanations. The assumptions alone do not specify any order among the variables - rather just a theoretical absence of direct association. We show how these assumptions, while not specifying any ordering, can when combined with the data through the likelihood function yield information about an underlying recursive order. We derive details of the method for multinormal random variables and apply the procedure to a simulated example.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/yao95a.html
https://proceedings.mlr.press/r0/yao95a.htmlNon-Linear Dimensionality Reduction: A Comparative Performance AnalysisWe present an analysis of the comparative performance of non-linear dimensionality reduction methods such as Non-Linear Mapping, NonMetric Multidimensional Scaling and the Kohonen Self-Organising Feature Map for which data sets of different dimensions are used. To obtain comparative measures of how well the mapping is performed, Procrustes analysis, the Spearman rank correlation coefficient and the scatter-plot diagram are used. Results indicate that, in low dimensions, Non-Linear Mapping has the best performance especially when measured in terms of the Spearman rank correlation coefficient. The output from the Kohonen SelfOrganising Feature Map is easier to interpret than the output from the other methods as it often provides a superior qualitative visual output. Also, the Kohonen Self-Organising Feature Map may outperform the other methods in a high-dimensional setting.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/vel95a.html
https://proceedings.mlr.press/r0/vel95a.htmlA Hill-Climbing Approach to Construct Near-Optimal Decision TreesWe consider the problem of identifying the state of an $n$ component coherent system, where each component can be working or failed. It is costly to determine the states of the components. The goal is to find a decision tree which specifies the order of the components to be tested with minimum expected cost. The problem is known to be NP-hard. We present an extremely promising heuristic method for creating effective decision trees, and computational results show that the method obtains optimal solutions for $95 %$ of the cases tested.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/sun95a.html
https://proceedings.mlr.press/r0/sun95a.htmlData Representations in LearningThis paper examines the effect of varying the coarse-ness (or fine-ness) in a data representation upon the learning or recognition accuracy achievable. This accuracy is quantified by the least probability of error in recognition also known as the Bayes error rate, assuming that there are finite number of classes into which each data element can be classified. By modeling the granularity variation of the representation as a refinement of the underlying probability structure of the data, we examine how the recognition accuracy varies. Specifically, refining the data representation leads to improved bounds on the probability of error. Indeed, this confirms the intuitive notion that more information can lead to improved decision-making. This analysis may be extended to multiresolution methods where coarse-to-fine and fine-to-coarse variations in representations are possible. Our research was motivated by examining the change in the recognition accuracy of $k$-nearest neighbor classifiers while the resolution of the data - optical character images - is varied. In this domain, the data resolution is crucial in determining trade-offs in the speed and accuracy of the OCR system.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/srikantan95a.html
https://proceedings.mlr.press/r0/srikantan95a.htmlRepresenting and Solving Asymmetric Decision Problems Using Valuation NetworksIn this paper, we investigate the use of valuation networks to represent and solve asymmetric decision problems. The structural asymmetry information is represented by indicator valuations. An indicator valuation is a special type of a probability valuation whose values are restricted to either 0 or 1 . Indicator valuations enable us to reduce the domain of probability valuations and this contributes greatly to improving the computational efficiency of the solution technique. We use indicator valuations to define effective frames as subsets of frames of variables. All numeric information is specified only for effective frames. The solution technique is mostly the same as in the symmetric case. The main difference is that all computations are done on the effective frames of variables. This contributes to the increased efficiency of the solution technique. Also, when restricted to effective frames, the values of indicator valuations are identically one, and therefore indicator valuations can be handled implicitly and this contributes further to the increased efficiency of the solution technique.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/shenoy95a.html
https://proceedings.mlr.press/r0/shenoy95a.htmlA Generalization of the Tetrad Representation TheoremThe tetrad representation theorem, due to Spirtes, Glymour, and Scheines (1993), gives a graphical condition necessary and sufficient for the vanishing of an individual tetrad difference in a linear structural equation model. In this paper, we generalize their result from individual tetrad differences to sets of tetrad differences of a certain form, and we simplify their proof. The generalization allows tighter constraints to be placed on the set of linear models compatible with given data and thereby facilitates the search for parsimonious models for large data sets.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/shafer95a.html
https://proceedings.mlr.press/r0/shafer95a.htmlCombining Statistics and AI in the Optimization of Semiconductors for Solar CellsIn the framework of a research project in photovoltaics, a flexible method of parameter optimization is developed. Target of the optimization is the quality of semiconductor materials for solar cells. The quality depends on the parameter values chosen for the semiconductor production process. The optimization method is based on the combined application of statistics and artificial intelligence. Experiment design is used to collect and analyze experimental data from the process in order to acquire knowledge about the relationship between parameter values and semiconductor quality. Classifiers built by machine learning algorithms help to determine semiconductor quality by the inspection of special signals obtainable from the running process. A final on-line hillclimbing search for optimal parameter values is guided by both the classifier and the knowledge about process behaviour derived from previous experiments.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/risius95a.html
https://proceedings.mlr.press/r0/risius95a.htmlFramework for a Generic Knowledge Discovery ToolkitIndustrial and commercial firms accumulate vast quantities of data in the course of their day-today business. The primary use of this data is to monitor business processes: inventory, maintenance actions, and so on. However this data contains much valuable information that, if accessible, would enhance the understanding of, and aid in improving the performance of, the processes being monitored. Traditional statistical procedures provide some insight into this data, but they are often misused in non-expert hands. With the rapidly increasing quantity of data, it is no longer cost effective for trained statisticians to analyze all the data. The number of variables and observations in these datasets is often very large, and the number of candidate statistical models that might be considered is too large to permit manual systematic exploration. In this type of situation, a Knowledge Discovery (KD) tool is the most effective way to explore the data.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/riddle95a.html
https://proceedings.mlr.press/r0/riddle95a.htmlLearning Bayesian Networks Using Feature SelectionThis paper introduces a novel enhancement for learning Bayesian networks with a bias for small, high-predictive-accuracy networks. The new approach selects a subset of features which maximizes predictive accuracy prior to the network learning phase. We examine explicitly the effects of two aspects of the algorithm, feature selection and node ordering. Our approach generates networks which are computationally simpler to evaluate and which display predictive accuracy comparable to that of Bayesian networks which model all attributes.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/provan95a.html
https://proceedings.mlr.press/r0/provan95a.htmlA Causal Calculus for Statistical ResearchMany statisticians are reluctant to deal with problems involving causal considerations because we lack the mathematical notation for distinguishing causal influence from statistical association. To address this problem, a notation is proposed that admits two conditioning operators: ordinary Bayes conditioning, $P(y \mid X=x)$, and causal conditioning, $P(y \mid \operatorname{set}(X=x))$, that is, conditioning $P(y)$ on holding $X$ constant (at $x)$ by external intervention. This distinction, which will be supported by three rules of inference, will permit us to derive probability expressions for the combined effect of observations and interventions. The resulting calculus yields simple solutions to a number of interesting problems in causal inference and should allow rank-and-file researchers to tackle practical problems that are generally considered too hard, or impossible. Examples are: 1. Deciding whether the information available in a given observational study is sufficient for obtaining consistent estimates of causal effects. 2. Deriving algebraic expressions for causal effect estimands. 3. Selecting measurements that would render randomized experiments unnecessary. 4. Selecting a set of indirect (randomized) experiments to replace direct experiments that are either infeasible or too expensive. 5. Predicting (or bounding) the efficacy of treatments from randomized trials with imperfect compliance. Starting with nonparametric specification of structural equations, the paper establishes the semantics necessary for a theory of interventions, presents the three rules of inference, demonstrates the use of the resulting calculus on a number of examples, and establishes an operational definition of structural equations.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/pearl95a.html
https://proceedings.mlr.press/r0/pearl95a.htmlSearching for Dependencies in Bayesian ClassifiersIn this paper, we explore an alternate approach to determining whether it is useful to join two attributes when constructing a Bayesian classifier. We also give experimental results on parity functions, an artificial set of functions that are particularly difficult for naive Bayesian classifiers, and results on three naturally occurring data sets.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/pazzani95a.html
https://proceedings.mlr.press/r0/pazzani95a.htmlDetecting Complex Dependencies in Categorical DataLocating and evaluating relationships among values in multiple streams of data is a difficult and important task. Consider the data flowing from monitors in an intensive care unit. Readings from various subsets of the monitors are indicative and predictive of certain aspects of the patient’s state. We present an algorithm that facilitates discovery and assessment of the strength of such predictive relationships called Multi-stream Dependency Detection (MSDD). We use heuristic search to guide our exploration of the space of potentially interesting dependencies to uncover those that are significant. We begin by reviewing the dependency detection technique described in [3], and extend it to the multiple stream case, describing in detail our heuristic search over the space of possible dependencies. Quantitative evidence for the utility of our approach is provided through a series of experiments with artificially-generated data. In addition, we present results from the application of our algorithm to two real problem domains: feature-based classification and prediction of pathologies in a simulated shipping network.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/oates95a.html
https://proceedings.mlr.press/r0/oates95a.htmlPart-of-Speech Tagging from "Small" Data SetsMany probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the LOB. Using the hidden Markov model method on a 900,000 token training corpus, it is not difficult achieve a success rate of 95 per cent on a 100,000 token test corpus. However, even such large training corpora contain few relatively few words. For example, the LOB contains about 45,000 words, most of which occur only once or twice. As a result, 3-4 per cent of tokens in the test corpus are unseen and cause a significant proportion of errors. A corpus large enough to accurately represent all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. This work argues this may not be necessary, describing variations on HMM-based tagging that facilitate learning from relatively little data, including ending-based approaches, incremental learning strategies, and the use of approximate distributions.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/neufeld95a.html
https://proceedings.mlr.press/r0/neufeld95a.htmlStatistical Preprocessing for Decision Tree InductionSome apparently simple numeric data sets cause significant problems for existing decision tree induction algorithms, in that no method is able to find a small, accurate tree, even though one exists. One source of this difficulty is the goodness measures used to decide whether a particular node represents a good way to split the data. This paper points out that the commonly-used goodness measures are not equipped to take into account some patterns in numeric attribute spaces, and presents a framework for capturing some such patterns into decision tree induction. As a case study, it is demonstrated empirically that supervised clustering, when used as a preprocessing step, can improve the quality of both univariate and multivariate decision trees.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/murthy95a.html
https://proceedings.mlr.press/r0/murthy95a.htmlTwo Applications of Statistical Modelling to Natural Language ProcessingEach week the Columbia-Presbyterian Medical Center collects several megabytes of English text transcribed from radiologists’ dictation and notes of their interpretations of medical diagnostic x-rays. It is desired to automate the extraction of diagnoses from these natural language reports. This paper reports on two aspects of this project requiring advanced statistical methods. First, the identification of pairs of words and phrases that tend to appear together (collocate) uses a hierarchical Bayesian model that adjusts to different word and word pair distributions in different bodies of text. Second, we present an analysis of data from experiments to compare the performance of the computer diagnostic program to that of a panel of physician and lay readers of randomly sampled texts. A measure of inter-subject distance with respect to the diagnoses is defined for which estimated variances and covariances are easily computed. This allows statistical conclusions about the similarities and dissimilarities among diagnoses by the various programs and experts.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/mouchel95a.html
https://proceedings.mlr.press/r0/mouchel95a.htmlModeling Life Time Data by Neural NetworksWith the advancement of sophisticated computer programs, much of the data analysis process such as graph drawing, hypothesis testing, and construction of interval estimates can be automated. One exception is the process of interpreting graphical data, which is still being done by trained statisticians. The efforts of computerizing the interpretation process of graphical data must address at least two issues. First, we need to incorporate the flexibility of trained statisticians. Second, we need to incorporate desirable subjectivity of experienced statisticians. This paper presents a method which automates the process of graphical analysis using neural networks trained by the Back-propagation learning rule. Two case studies were performed to demonstrate the feasibility of the method. Particularly, the empirical case study has demonstrated the effectiveness of the neural network approach.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/moon95a.html
https://proceedings.mlr.press/r0/moon95a.htmlDynamical Learning Bias SelectionDetermining the conditions for which a given learning algorithm is appropriate is an open problem in machine learning. Methods for selecting a learning algorithm for a given domain or for a portion of the domain have met with limited success. This paper proposes a new approach to predicting a given example’s class by locating it in the "example space" and then choosing the best learner(s) in that region of the example space to make predictions. The regions of the example space are defined by the prediction patterns of the learners being used. The learner(s) chosen for prediction are selected according to their past performance in that region. This dynamic approach to learning bias selection is compared to other methods for selecting from multiple learning algorithms.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/merz95a.html
https://proceedings.mlr.press/r0/merz95a.htmlAn Exact Probability Metric for Decision Tree Splitting and StoppingID3’s information gain heuristic [16] is well-known to be biased towards multi-valued attributes. This bias is only partially compensated by the gain ratio used in C4.5 [20]. Several alternatives have been proposed, notably orthogonality [9], and Beta [5]. Gain ratio and orthogonality are strongly correlated, and all of the metrics share a common bias towards splits with one or more small expected values, under circumstances where the split likely ocurred by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the posterior probability of the null hypothesis. Both gain and the chi-squared significance test are shown to arise in asymptotic approximations to the hypergeometric, revealing similar criteria for admissibility and showing the nature of their biases. Previous failures to find admissible stopping rules in CART [3, pp 59-66] and ID3 [20, pp 36-37] are traced to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. Empirical results show that pre-pruning should be done, as trees pruned in this way are simpler, more efficient, and no less accurate than unpruned trees. Average training time is reduced by up to $30 %$, and expensive post-pruning avoided.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/martin95b.html
https://proceedings.mlr.press/r0/martin95b.htmlEvaluating and Comparing Classifiers: Complexity MeasuresRelevant literature on Kolmogorov complexity measures and on trade-offs of classifier accuracy for reduced complexity is reviewed, seeking a pragmatic methodology for the practising applications analyst. Significant findings are that: (1) An accuracy/complexity trade-off is desirable; (2) Combined measures of accuracy/complexity are not practical due to difficulties encoding constraint satisfication, lack of sampling statistics and suitable tests of the null hypothesis, and practical difficulties of encoding complex functions and encoding across families of classifiers; (3) Therefore, a generalized version of the CART [5] 1-SE rule is recommended; (4) Kolmogorov complexity is not practically computable (see (2)); and, therefore, (6) Simply measuring response times on a target environment is the recommended measure of complexity.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/martin95a.html
https://proceedings.mlr.press/r0/martin95a.htmlMENTOR: A Bayesian Model for Prediction and Intervention in Mental RetardationMental Retardation(MR) or mental deficiency is a complex medical and social problem. The prevalence is estimated to be about $2.5$ per cent of the population [Bats93], [StSu92]. Various studies have reported somewhat different figures(between $2 %$ and $5 %$ ) depending on the definition of MR adopted and the age group surveyed [StSu92]. It is a developmental disability with a complex etiology. The causative factors and mechanisms are not well understood. According to the American Association on Mental Retardation (AAMR), "Mental Retardation is characterized by significantly subaverage intellectual functioning" [AAMR92, page 5]. The AAMR has recommended that people scoring below two Standard Deviations (SD) in a standardized IQ test be classified as retarded [AAMR92, page 5]. These tests are normalized to a mean of 100 with a SD of 15 . Those with scores below 50 are considered severly retarded. Scores in the category of $50-69$ fall in the classification of Mild Mental Retardation (MMR). Though AAMR suggests inclusion of limitation of adaptive skills also [AAMR92, page 6], many studies have used cognitive tests (IQ scores) for classification [StSu92], [McDe93]. A category called Borderline Mental Retardation (BMR)-scores falling between one and two standard deviations, was in vogue previously. But due to the social stigma attached to MR and concerns about test errors, it was de-emphasized subsequently. We shall go by IQ scores and keep the category of BMR for understanding causal mechanisms. For severe MR a cause can be found in the majority of cases. In MMR, which forms $85 %$ of MR, a cause cannot be put down in half the cases [Bats 93 ]. So here we have a complex web of unknown causal mechanisms, disagreement among experts, controversies (the large literature of nature versus nurture) and serious gaps in the experts’ understanding of the etiological factors.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/mani95a.html
https://proceedings.mlr.press/r0/mani95a.htmlAn Axiomatization of Loglinear Models with an Application to the Model-Search ProblemA good strategy to save computational time in a model-search problem consists in endowing the search procedure with a mechanism of logical inference, which sometimes allows an interaction model to be accepted or rejected without resorting to the numeric test. In principle, the best inferential mechanism should based on a sound and complete axiomatization of interaction models. We present a sound (and, probably incomplete) axiomatization which can be translated into a graphical inference procedure working with directed acyclic graphs.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/malvestuto95a.html
https://proceedings.mlr.press/r0/malvestuto95a.htmlOn Test Selection Strategies for Belief NetworksDecision making under uncertainty typically requires an iterative process of information acquisition. At each stage, the decision maker chooses the next best test (or tests) to perform, and re-evaluates the possible decisions. Value-of-information analyses provide a formal strategy for selecting the next test(s). However, the complete decision-theoretic approach is impractical and researchers have sought approximations. In this paper, we present strategies for both myopic and limited non-myopic (working with known test groups) test selection in the context of belief networks. We focus primarily on utility-free test selection strategies. However, the methods have immediate application to the decision-theoretic framework.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/madigan95a.html
https://proceedings.mlr.press/r0/madigan95a.htmlTree Structured Interpretable RegressionWe describe a new method of regression closely related to the regression ideas CART. which has the following potential advantages over traditional methods: the method can naturally be applied to very large datasets in which only a small proportion of the predictors are useful, the resulting regression rules are more easily interpreted and applied, and may be more accurate in application, since the rules are derived by means of a crossvalidation technique which maximizes their predictive accuracy. The system is evaluated in an empirical study and compared to traditional regression and CART systems.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/lubinsky95a.html
https://proceedings.mlr.press/r0/lubinsky95a.htmlPropagation of Gaussian Belief FunctionsGaussian belief functions are represented in both variable space and configuration space. Their combinations are defined in terms of the Dempster’s rule, sweep operators, and restrictions in configuration space. The equivalence of the alternative definitions is proved. The computation of Gaussian belief functions is shown to follow the Shafer-Shenoy axioms.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/liu95a.html
https://proceedings.mlr.press/r0/liu95a.htmlOn Graphical Models for Dynamic SystemsIt is widely recognized that probabilistic graphical models provide a good framework for both knowledge representation and probabilistic inference (e.g., see [2],[14]). The dynamic behaviour of a system which changes over the time needs an implicit or explicit time representation. In this paper, an implicit time representation using dynamic graphical models is proposed. Our goal is to model the state of a system and its evolution over the time in a richer and more natural way than other approaches together with a more suitable treatment of the inference on variables of interest.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/lekuona95a.html
https://proceedings.mlr.press/r0/lekuona95a.htmlStructure Learning of Bayesian Networks by Hybrid Genetic AlgorithmsThis paper demonstrates how Genetic Algorithms can be used to discover the structure of a Bayesian Network from a given database with cases. The results presented, were obtained by applying four different types of Genetic Algorithms - SSGA (Steady State Genetic Algorithm), GAe $\lambda$ (Genetic Algorithm elistist of degree $\lambda$ ), hSSGA (hybrid Steady State Genetic Algorithm) and the hGAe $\lambda$ (hybrid Genetic Algorithm elitist of degree $\lambda$ ) - to simulations of the ALARM Network. The behaviour of the mentioned algorithms is studied with respect to their parameters.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/larranaga95a.html
https://proceedings.mlr.press/r0/larranaga95a.htmlHierarchical Clustering of Composite Objects with a Variable Number of ComponentsThis paper examines the problem of clustering a sequence of objects that cannot be described with a predefined list of attributes (or variables). In many applications, such a crisp representation cannot be determined. An extension of the traditionnal propositionnal formalism is thus proposed, which allows objects to be represented as a set of components. The algorithm used for clustering is briefly illustrated, and mechanisms to handle sets are described. Some empirical evaluations are also provided, to assess the validity of the approach.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/ketterlin95a.html
https://proceedings.mlr.press/r0/ketterlin95a.htmlViewpoint-Based Measurement of Semantic Similarity between WordsA method of measuring semantic similarity between words using a knowledgebase constructed automatically from machine-readable dictionaries is proposed. The method takes into consideration the fact that similarity changes depending on situation or context, which we call ’viewpoint’. A feature of the method is that certain parts of the overall concept of words, compared with each other, are emphasized by using the viewpoint when calculating the degree of similarity. Evaluation shows the proposed method, although based on a simply structured knowledge-base, is superior to other currently available methods.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/kasahara95a.html
https://proceedings.mlr.press/r0/kasahara95a.htmlRobust Linear Discriminant TreesWe present a new method for the induction of classification trees with linear discriminants as the partitioning function at each internal node. This paper presents two main contributions: first, a novel objective function called soft entropy which is used to identify optimal coefficients for the linear discriminants, and second, a novel method for removing outliers called iter ative re-filtering which boosts performance on many datasets. These two ideas are presented in the context of a single learning algorithm called DT-SEPIR.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/john95a.html
https://proceedings.mlr.press/r0/john95a.htmlSolving Influence Diagrams Using Gibbs SamplingWe describe a Monte Carlo method for solving influence diagrams. This method is a combination of stochastic dynamic programming and Gibbs sampling, an iterative Markov chain Monte Carlo algorithm. Our method is especially useful when exact methods for solving influence diagrams fail.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/jenzarli95a.html
https://proceedings.mlr.press/r0/jenzarli95a.htmlDetecting Imperfect Patterns in Event Streams Using Local SearchThe problem of inferring causality from empirical observations has been well studied. Several approaches are notable for efficiently constructing complex causal models of the inter-relationships between variables ( e.g.,[8,3,2]). These approaches tend to rely on correlations and co-variances among the variables as the basis for inferring causality. However, for some applications, the available data are categorical observations over time (e.g., event streams or execution traces of programs); for example, patterns in execution traces form the basis of several methods of debugging software (e.g., [1,4]). These applications are less amenable to solution by methods based on correlation and co-variance. An alternative approach, called Dependency Detection, searches the event streams for often recurring sequences [6]. The set of recurring sequences (called dependencies) indicates events that commonly co-occur and forms a weak model of causality. Although promising, dependency detection is limited in several ways. First, the underlying search is exhaustive, looking for all possible dependencies in the data. As a consequence, the computational complexity increases exponentially with the length of the sequences. Second, the sequences are rigid, which means that only exact matches count and that any noise (i.e., insertion of some other unrelated event into the stream) will not count as an example of the sequence. Third, the technique considers only a single stream rather than multiple streams of parallel events; Oates et al. have developed a technique for multi-stream dependency detection [7]. This paper describes how local search with flexible matching has been used to overcome the first two limitations.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/howe95a.html
https://proceedings.mlr.press/r0/howe95a.htmlA Decision-Based View of CausalityWe present a precise definition of cause and effect in terms of a more fundamental notion called unresponsiveness. Our definition departs from the traditional view of causation in that our causal assertions are made relative to a set of decisions. An important consequence of this departure is that we can reason about cause locally, not necessarily attaching a causal explanation to every dependency. Such local reasoning can be beneficial in that, given a set of real decisions to make, it may not be necessary to determine whether some dependencies are causal. Also in this paper, we examine the graphical encoding of causal relationships. We show that ordinary influence diagrams are an inadequate representation of cause, whereas influence diagrams in Howard Canonical Form can always represent cause and effect accurately. In addition, we establish a correspondence between Pearl and Verma’s (1991) causal model and the influence diagram.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/heckerman95a.html
https://proceedings.mlr.press/r0/heckerman95a.htmlOmega-Stat: An Environment for Implementing Intelligent Modeling StrategiesOmega-Stat is a new statistical environment built on Lisp-Stat, an object-oriented statistical programming environment. It contains extensible, reusable-component libraries for performing data management, multivariate analyses, modeling, and dynamic graphics. The point-and-click user interface allows instant access to all objects, including analysis and graphics objects comprising a semantic map. Knowledge, and methods for accessing this knowledge, are embedded within model objects and edge objects linking these models. This will allow the modeling process to be studied by following the analysis trails of expert analysts. The objective is to provide an "expert consultant" that is accessible as part of the man/machine interaction. Modeling strategies can then be built into Omega-Stat by using prior knowledge and data-analytic heuristics to guide the process of constructing the model tree and the iterative search for an "optimal" model.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/harner95a.html
https://proceedings.mlr.press/r0/harner95a.htmlAbductive Reasoning in Bayesian Belief Networks Using a Genetic AlgorithmBayesian belief networks (causal networks) have been extensively studied in the past ten years. It has been shown that they provide a sound formalism for probabilistic reasoning, especially if uncertainty is to be represented. A probability space can be modelled as a Bayesian belief network of propositional variables (nodes) which may be pairwise connected by directed arcs. The interpretation is that if an arc exists from node A to node $B$, the probability of node $B$ assuming a given state $b_{i}$ depends on the actual state of node $A$ ( $A$ is a direct cause of $B$ ). The absence of an arc between two nodes implies that there is no such direct dependence. Thus, in a Bayesian belief network, probabilistic dependencies are modelled as arcs between nodes, independencies are implied by the absence of arcs. If for a given probability space, for all states of the root nodes the prior probabilities are known, and in addition, for all non-root nodes the conditional probabilities, given the parent states, the joint probability distribution is completely known. Textbooks on Bayesian belief networks are [1] and [2].Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/gelsema95a.html
https://proceedings.mlr.press/r0/gelsema95a.htmlLearning Possibilistic Networks from DataWe introduce a method for inducing the structure of (causal) possibilistic networks from databases of sample cases. In comparison to the construction of Bayesian belief networks, the proposed framework has some advantages, namely the explicit consideration of imprecise (set-valued) data, and the realization of a controlled form of information compression in order to increase the efficiency of the learning strategy as well as approximate reasoning using local propagation techniques. Our learning method has been applied to reconstruct a non-singly connected network of 22 nodes and 22 arcs without the need of any a priori supplied node ordering.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/gebhardt95a.html
https://proceedings.mlr.press/r0/gebhardt95a.htmlClassifying New Words for Robust ParsingRobust natural language parsing systems must be able to handle words that are not in their lexicons. This paper describes a statistical classifier that determines the most likely parts of speech of new words. The classifier uses a loglinear model to obtain smoothed conditional probabilities that take into account the interactions between different features. We show accuracy results for this model, and compare it to some simpler methods.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/franz95a.html
https://proceedings.mlr.press/r0/franz95a.htmlIndexThu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/fisher95b.html
https://proceedings.mlr.press/r0/fisher95b.htmlFrontmatterThu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/fisher95a.html
https://proceedings.mlr.press/r0/fisher95a.htmlWhich method learns most from the data? Methodological Issues in the Analysis of Comparative StudiesThe mutual discovery of the statistical and artificial intelligence communities (see e.g. [Han93, CO94]) has resulted in many studies which compare the performance of statistical and machine learning methods on empirical data sets; examples are the StatLog project ([MST94]) and the Santa Fe Time Series Competition ([WG94]), as well as numerous journal articles ([KWR93, RABCK93, WHR90, TAF91, TK92, FG93]). What has struck us is the casual manner comparisons are typically carried out in the literature. The ranking of $k$ preselected methods is performed by training (estimating in statistical terminology) them on a single data set, and estimating their respective mean prediction errors (MPE) from a hold-out sample. The methods are, subsequently, ranked according to their estimated MPEs. When the total number of observations is small, usually cross-validation rather than a hold-out sample is used to estimate the mean prediction errors. A more rigourous comparison of methods should include significance testing rather than giving a mere ranking based on the estimated MPEs. The statistical analysis of comparative studies, method ranking in particular, is addressed in this paper. Specifically, we address methodological issues of studies in which the performance of several regression or classification methods is compared on empirical data sets.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/feelders95a.html
https://proceedings.mlr.press/r0/feelders95a.htmlA Further Comparison of Pruning Methods in Decision-Tree InductionThu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/esposito95a.html
https://proceedings.mlr.press/r0/esposito95a.htmlHeuristic Search for Model StructureThu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/elder95a.html
https://proceedings.mlr.press/r0/elder95a.htmlLearning in Hybrid Noise Environments Using Statistical QueriesWe consider theoretical models of learning from noisy data. Specifically, we focus on learning in the probability approximately correct model as defined by Valiant. Two of the most widely studied models of noise in this setting have been classification noise and malicious errors. However, a more realistic model combining the two types of noise has not been formalized. We define a learning environment based on a natural combination of these two noise models. We first show that hypothesis testing is possible in this model. We next describe a simple technique for learning in this model, and then describe a more powerful technique based on statistical query learning. We show that the noise tolerance of this improved technique is roughly optimal with respect to the tolerance of the statistical query algorithm and that it provides a smooth tradeoff between the tolerable amounts of the two types of noise. Finally, we show that statistical query simulation yields learning algorithms for other combinations of noise models, thus demonstrating that statistical query specification truly captures the generic fault tolerance of a learning algorithm.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/decatur95a.html
https://proceedings.mlr.press/r0/decatur95a.htmlTextual Data MiningMost automated or semi-automated techniques for extracting novel information from data have concentrated on analyzing simple tables of numeric or atomic symbolic values. A related (but much more complex) problem, that of inferring new facts or knowledge from textual databases, has been addressed most effectively by the library and information retrieval research communities. This paper incorporates several ad hoc search strategies proposed by those communities into a single search methodology that guides the search process and provides a framework for the presentation of facts gleaned from the search. This graphical search result representation is semi-formal, in the sense that it represents the structure of search results formally while the contents of the search are represented informally. The methodology is intended as an aid to "mining" new scientific information from textual/bibliographic databases, rather than as an automated proof system.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/cunningham95a.html
https://proceedings.mlr.press/r0/cunningham95a.htmlTruncated Gaussians as Tolerance SetsThis work presents a new class of statistical models that are well suited for several Robotics applications, such as object recognition or computer vision. Our approach deals with bounded data: measurements that are constrained to appear in a bounded region in the measurement space. The literature refers to the set where the measurement can appear as the tolerance set for the measurement. To date, few statistical models for bounded variables are used in Artificial Intelligence. The most common model is the uniform distribution, but this approach has several drawbacks: summation of uniform variables does not yield a uniform variable and application of Bayes rule is computationally intensive [9]. Another approach is to use the Gaussian distribution and model bounds through an ad hoc selection mechanism [1, 2]. In short, even though bounds contain significant information, they have not yet received proper attention.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/cozman95a.html
https://proceedings.mlr.press/r0/cozman95a.htmlUsing Causal Knowledge to Learn More Useful Decision Rules From DataOne of the most popular and enduring paradigms in the intersection of machine-learning and computational statistics is the use of recursive-partitioning or "tree-structured" methods to "learn" classification trees from data sets [Buntine, 1993; Quinlan, 1986]. This approach applies to independent variables of all scale types (binary, categorical, ordered categorical, and continuous) and to noisy as well as to noiseless training sets. It produces classification trees that can readily be reexpressed as sets of expert systems rules (with each conjunction of literals corresponding to a set of values for variables along one branch through the tree). Each such rule produces a probability vector for the possible classes (or dependent variable values) that the object being classified may have, thus automatically presenting confidence and uncertainty information about its conclusions. Classification trees can be validated by methods such as cross-validation (Breiman et al., 1984), and they can easily be modified to handle missing data by constructing rules that exploit only the information contained in the observed variables.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/cox95a.html
https://proceedings.mlr.press/r0/cox95a.htmlCausal Discovery from Data in the Presence of Selection BiasRecent research advances have made it possible to consider using observational data to infer causal relationships among measured variables. Selection bias results from the observation of entities that are not representative of the entities that are generated by a causal process of interest. This paper shows that we can sometimes detect the presence of selection bias in observational data. The paper also demonstrates how selection bias can hinder the discovery of causal relationships from observational data. As we will describe, the use of experimental data (e.g., data from randomized, controlled trials) to discover causal relationships can be susceptible as well to problems involving selection bias. We offer suggestions for how to proceed with causal discovery in the face of selection bias.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/cooper95a.html
https://proceedings.mlr.press/r0/cooper95a.htmlTwo Algorithms for Inducing Structural Equation Models from DataWe present two algorithms for inducing structural equation models from data. Assuming no latent variables, these models have a causal interpretation and their parameters may be estimated by linear multiple regression. Our algorithms are comparable with PC [15] and IC [12,11], which rely on conditional independence. We present the algorithms and empirical comparisons with $\mathrm{PC}$ and IC.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/cohen95a.html
https://proceedings.mlr.press/r0/cohen95a.htmlLearning Bayesian Networks: Search Methods and Experimental ResultsWe discuss Bayesian approaches for learning Bayesian networks from data. First, we review a metric for computing the relative posterior probability of a network structure given data developed by Heckerman et al. (1994 a, b, c). We see that the metric has a property useful for inferring causation from data. Next, we describe search methods for identifying network structures with high posterior probabilities. We describe polynomial algorithms for finding the highestscoring network structures in the special case where every node has at most $k=1$ parent. We show that the general case $(k>1)$ is NP-hard, and review heuristic search algorithms for this general case. Finally, we describe a methodology for evaluating learning algorithms, and use this methodology to evaluate various scoring metrics and search procedures.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/chickering95a.html
https://proceedings.mlr.press/r0/chickering95a.htmlAnalysis and Application of the Generalized Mean-Shift ProcessThe mean shift process repeatedly moves each data point to the mean of data points in its neighborhood. This process is generalized and analyzed. Its relation with maximum-entropy and $\mathrm{K}$-means clustering methods is studied. Its nature of gradient mapping is revealed. Its applications in clustering, Hough transform, and overfitting relaxation are examined.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/cheng95a.html
https://proceedings.mlr.press/r0/cheng95a.htmlPredicting Stock Returns with Genetic Programming: Do the Short-Term Nonlinear Regularities Exist?This paper is devoted to applying the genetic programming paradigm to the test of the captial market efficiency hypothesis. How this paradigm is distinguished from the existing statistical approaches is briefly reviewed. Instead of using the large-sample analysis prevailing in the literature, this research rests on a small-sample analysis to inquire the existence of short-term non-linear regularities. By Rissanen’s MDLP (Minimum Description Length Principle), the sample period with the highest complexity is chosen. Since our simulation results, which are based on Koza’s genetic programming paradigm (KGP) and its Bayesian modification (BGP), show that it is not easy to outperform $A R(1)$ and is extremely difficult to beat random walk, the nonlinear regularities, while might exist, is very difficult to be found. Therefore, the capital market efficiency hypothesis can, at least, sustain from this perspective.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/chen95a.html
https://proceedings.mlr.press/r0/chen95a.htmlTailoring Rulesets to Misclassification CostsThis paper studies the capabilities obtained by modifying Quinlan’s [9] C4.5 programs for inducing decision trees and rules to permit the specification of unequal misclassification costs for binary classification tasks. Setting this cost value allows important parameters such as the percentage classified as a given class to be moved over their complete range. In some applications such parameters require precise control, but a considerable degree of variation appears difficult to suppress, particularly with rules: it is present even in the unmodified versions that treat all errors as equal. Crossvalidation over a range of cost values seems the appropriate way to tune such parameters. Independent of misclassifications costs, the ability to explore a spectrum of classifiers can considerably assist exploratory data analysis, delivering clearer rules than the standard version may provide. These conclusions are illustrated on a simulated version of the game Blackjack.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/catlett95a.html
https://proceedings.mlr.press/r0/catlett95a.htmlComparing the Prediction Accuracy of Statistical Models and Artificial Neural Networks in Breast CancerPredicting survival is important in cancer because it determines patient therapy, it matches patients for clinical trials, and it provides information to the patient. For over thirty years measuring cancer outcome has been based on the TNM Stage model. There are two problems with this model: (1) it is not very accurate (44% accurate for breast cancer), and (2) its accuracy can not be improved because predictive variables can not be added to the model without increasing the model’s complexity to the point where it is not longer useful to the clinician. There are several statistical models that have the potential to replace the existing TNM Stage model. All of these models can integrate new prognostic factors to increase measurement accuracy. But they are not all equally accurate, and they do not all equally meet the criteria for a new prognostic system set by the American Joint Committee on Cancer (Burke HB, Henson DE. Criteria for prognostic factors and for an enhanced prognostic system. Cancer 1993 ; 72: 3131-5). We compare the most powerful statistical models in terms of their accuracy in predicting five year breast cancer-specific survival. These models include principal component analysis, classification and regression tress both pruned and shrunk, stepwise logistic regression, and five types of artificial neural networks.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/burke95a.html
https://proceedings.mlr.press/r0/burke95a.htmlSoftware for Data Analysis with Graphical ModelsProbabilistic graphical models are being used widely in artificial intelligence and statistics, for instance, in diagnosis and expert systems, as a framework for representing and reasoning with probabilities and independencies. They come with corresponding algorithms for performing statistical inference. This offers a unifying framework for prototyping and/or generating data analysis algorithms from graphical specifications. This paper illustrates the framework with an example and then presents some basic techniques for the task: problem decomposition and the calculation of exact Bayes factors. Other tools already developed, such as automatic differentiation, Gibbs sampling, and use of the EM algorithm, make this a broad basis for the generation of data analysis software.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/buntine95a.html
https://proceedings.mlr.press/r0/buntine95a.htmlDiscovering Morphemic Suffixes A Case Study In MDL InductionThis paper reports experiments in the automatic discovery of linguistically significant regularities in text. The minimum description length principle is exploited to evaluate linguistic hypotheses with respect to a corpus and a theory of the types of regularities to be found in it. The domain of inquiry in this paper is the discovery of morphemic suffixes such as English -ing and -ly, but the technique is widely applicable to language learning problems.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/brent95a.html
https://proceedings.mlr.press/r0/brent95a.htmlDecision-Theoretic Case-Based ReasoningWe describe a decision-theoretic methodology for case-based reasoning in diagnosis and troubleshooting applications. The system utilizes a special-structure belief network to represent diagnostic cases, with nodes representing issues, causes, and symptoms. Dirichlet distributions are assessed at knowledge acquisition time to indicate the strength of relationships between variables. During a diagnosis session, a relevant subnetwork is extracted from a belief-network database that describes a very large number of diagnostic interactions and cases. The constructed network is used to make recommendations regarding possible repairs and additional observations, based on an estimate of expected repair costs. As cases are resolved, observations of issues, causes, symptoms, and the success of repairs are recorded. New variables are added to the database, and the probabilities associated with variables already in the database are updated. In this way, the inferential behavior of system adjusts to the characteristics of the target population of users. We show how these elements work together in a cycle of troubleshooting tasks, and describe some results from a pilot system implementation and deployment.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/breese95a.html
https://proceedings.mlr.press/r0/breese95a.htmlPloxoma: Testbed for Uncertain InferenceThis paper compares two formalisms for uncertain inference, Kyburg’s Combinatorial Semantics and Dempster-Shafer belief function theory, on the basis of an example from the domain of medical diagnosis. I review Shafer’s example about the imaginary disease ploxoma and show how it would be represented in Combinatorial Semantics. I conclude that belief function theory has a qualitative advantage because it offers greater flexibility of expression, and provides results about more specific classes of patients. Nevertheless, a quantitative comparison reveals that the inferences sanctioned by Combinatorial Semantics are more reliable than those of belief function theory.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/blau95a.html
https://proceedings.mlr.press/r0/blau95a.htmlPicking the Best Expert from a SequenceWe examine the problem of finding a good expert from a sequence of experts. Each expert has an "error rate"; we wish to find an expert with a low error rate. However, each expert’s error rate is unknown and can only be estimated by a sequence of experimental trials. Moreover, the distribution of error rates is also unknown. Given a bound on the total number of trials, there is thus a tradeoff between the number of experts examined and the accuracy of estimating their error rates. We present a new expert-finding algorithm and prove an upper bound on the expected error rate of the expert found. A second approach, based on the sequential ratio test, gives another expert-finding algorithm that is not provably better but which performs better in our empirical studies.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/bergman95a.html
https://proceedings.mlr.press/r0/bergman95a.htmlTools for Empirically Analyzing AI ProgramsThe paper describes two separate but synergistic tools for running experiments on large Lisp systems such as Artificial Intelligence planning systems, by which we mean systems that produce plans and execute them in some kind of simulator. The first tool, called CLIP (Common Lisp Instrumentation Package), allows the researcher to define and run experiments, including experimental conditions (parameter values of the planner or simulator) and data to be collected. The data are written out to data files that can be analyzed by statistics software. The second tool, called CLASP (Common Lisp Analytical Statistics Package), allows the researcher to analyze data from experiments by using graphics, statistical tests, and various kinds of data manipulation. CLASP has a graphical user interface (using CLIM, the Common Lisp Interface Manager, Version 2.0) and also allows data to be directly processed by Lisp functions. CLIP and CLASP form the foundation of a larger set of specialized tools we are building for the empirical analysis of AI programs.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/anderson95a.html
https://proceedings.mlr.press/r0/anderson95a.htmlControl Representation in an EDA Assistant While research in statistics and artificial intelligence has addressed issues in the automation of later stages of analysis, such as theory generation, model selection, and experiment design [23], less attention has been given to initial exploration of data. We have developed a novel approach to exploration as search. This paper gives an overview of the design of AIDE, the Assistant for Intelligent Data Exploration, which assists humans in the early stages of data analysis [1]. The system adopts a script-based planning approach to automating EDA. Data-directed mechanisms extract simple observations and suggestive indications from the data. Scripted EDA operations are then applied in goal-directed fashion to generate deeper descriptions of the data. Control rules guide the EDA operations, relying on intermediate results for their decisions. The system is mixed-initiative, capable of autonomously pursuing high and low level goals while still allowing the user to guide or override its decisions. AIDE is currently a prototype under development. We emphasize that the work presented here is incomplete.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/amant95a.html
https://proceedings.mlr.press/r0/amant95a.htmlMissing Data Models as Meta-DataThu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/almond95b.html
https://proceedings.mlr.press/r0/almond95b.htmlHypergraph Grammars for Knowledge Based Model ConstructionGraphical belief networks, including Bayes nets and influence diagrams, can be represented with directed hypergraphs. Each directed hyperedge corresponds to a factor of the joint distribution of all variables in the model. A hyperedge replacement grammar is a collection of rules for replacing hyperedges with hypergraphs. A hyperedge replacement grammar for graphical belief networks defines a collection of graphical belief models. Hyperedge replacement grammars have several interesting implications in the construction of graphical models. (1) They provide a way to represent the process of constructing a graphical model. (2) Coupled with an object-oriented variable type system, provide a convenient method for searching through candidate factors to fill a particular slot in the model graph. (3) They provide a method for integrating high-level and detailed views of a graphical model. (4) They provide a mechanism for representing uncertainty about the model structure.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/almond95a.html
https://proceedings.mlr.press/r0/almond95a.htmlLearning Multiple Relational Rule-based ModelsWe present a method for learning multiple relational models for each class in the data. Bayesian probability theory offers an optimal strategy for combining classifications of the individual concept descriptions. Here we use a tractable approximation to that theory. Previous work in learning multiple models has been in the attribute-value realm. We show that stochastically learning multiple relational (first-order) models consisting of a ruleset for each class also yields gains in accuracy when compared to the accuracy of a single deterministically learned relational model. In addition we show that learning multiple models is most helpful when the hypothesis space is "flat" with respect to the gain metric used in learning.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/ali95a.html
https://proceedings.mlr.press/r0/ali95a.htmlA Comparative Evaluation of Sequential Feature Selection AlgorithmsSeveral recent machine learning publications demonstrate the utility of using feature selection algorithms in supervised learning tasks. Among these, sequential feature selection algorithms are receiving attention. The most frequently studied variants of these algorithms are forward and backward sequential selection. Many studies on supervised learning with sequential feature selection report applications of these algorithms, but do not consider variants of them that might be more appropriate for some performance tasks. This paper reports positive empirical results on such variants, and argues for their serious consideration in similar learning tasks.Thu, 05 Jan 1995 00:00:00 +0000
https://proceedings.mlr.press/r0/aha95a.html
https://proceedings.mlr.press/r0/aha95a.html