Two Applications of Statistical Modelling to Natural Language Processing

William Du Mouchel, Carol Friedman, George Hripcsak, Stephen B. Johnson, Paul D. Clayton
Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, PMLR R0:192-198, 1995.

Abstract

Each week the Columbia-Presbyterian Medical Center collects several megabytes of English text transcribed from radiologists’ dictation and notes of their interpretations of medical diagnostic x-rays. It is desired to automate the extraction of diagnoses from these natural language reports. This paper reports on two aspects of this project requiring advanced statistical methods. First, the identification of pairs of words and phrases that tend to appear together (collocate) uses a hierarchical Bayesian model that adjusts to different word and word pair distributions in different bodies of text. Second, we present an analysis of data from experiments to compare the performance of the computer diagnostic program to that of a panel of physician and lay readers of randomly sampled texts. A measure of inter-subject distance with respect to the diagnoses is defined for which estimated variances and covariances are easily computed. This allows statistical conclusions about the similarities and dissimilarities among diagnoses by the various programs and experts.

Cite this Paper


BibTeX
@InProceedings{pmlr-vR0-mouchel95a, title = {Two Applications of Statistical Modelling to Natural Language Processing}, author = {Mouchel, William Du and Friedman, Carol and Hripcsak, George and Johnson, Stephen B. and Clayton, Paul D.}, booktitle = {Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics}, pages = {192--198}, year = {1995}, editor = {Fisher, Doug and Lenz, Hans-Joachim}, volume = {R0}, series = {Proceedings of Machine Learning Research}, month = {04--07 Jan}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/r0/mouchel95a/mouchel95a.pdf}, url = {https://proceedings.mlr.press/r0/mouchel95a.html}, abstract = {Each week the Columbia-Presbyterian Medical Center collects several megabytes of English text transcribed from radiologists’ dictation and notes of their interpretations of medical diagnostic x-rays. It is desired to automate the extraction of diagnoses from these natural language reports. This paper reports on two aspects of this project requiring advanced statistical methods. First, the identification of pairs of words and phrases that tend to appear together (collocate) uses a hierarchical Bayesian model that adjusts to different word and word pair distributions in different bodies of text. Second, we present an analysis of data from experiments to compare the performance of the computer diagnostic program to that of a panel of physician and lay readers of randomly sampled texts. A measure of inter-subject distance with respect to the diagnoses is defined for which estimated variances and covariances are easily computed. This allows statistical conclusions about the similarities and dissimilarities among diagnoses by the various programs and experts.}, note = {Reissued by PMLR on 01 May 2022.} }
Endnote
%0 Conference Paper %T Two Applications of Statistical Modelling to Natural Language Processing %A William Du Mouchel %A Carol Friedman %A George Hripcsak %A Stephen B. Johnson %A Paul D. Clayton %B Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 1995 %E Doug Fisher %E Hans-Joachim Lenz %F pmlr-vR0-mouchel95a %I PMLR %P 192--198 %U https://proceedings.mlr.press/r0/mouchel95a.html %V R0 %X Each week the Columbia-Presbyterian Medical Center collects several megabytes of English text transcribed from radiologists’ dictation and notes of their interpretations of medical diagnostic x-rays. It is desired to automate the extraction of diagnoses from these natural language reports. This paper reports on two aspects of this project requiring advanced statistical methods. First, the identification of pairs of words and phrases that tend to appear together (collocate) uses a hierarchical Bayesian model that adjusts to different word and word pair distributions in different bodies of text. Second, we present an analysis of data from experiments to compare the performance of the computer diagnostic program to that of a panel of physician and lay readers of randomly sampled texts. A measure of inter-subject distance with respect to the diagnoses is defined for which estimated variances and covariances are easily computed. This allows statistical conclusions about the similarities and dissimilarities among diagnoses by the various programs and experts. %Z Reissued by PMLR on 01 May 2022.
APA
Mouchel, W.D., Friedman, C., Hripcsak, G., Johnson, S.B. & Clayton, P.D.. (1995). Two Applications of Statistical Modelling to Natural Language Processing. Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R0:192-198 Available from https://proceedings.mlr.press/r0/mouchel95a.html. Reissued by PMLR on 01 May 2022.

Related Material