Part-of-Speech Tagging from "Small" Data Sets

Eric Neufeld, Greg Adams, Henry Choy, Ron Orthner, Tim Philip, Ahmed Tawfik
Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, PMLR R0:410-416, 1995.

Abstract

Many probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the LOB. Using the hidden Markov model method on a 900,000 token training corpus, it is not difficult achieve a success rate of 95 per cent on a 100,000 token test corpus. However, even such large training corpora contain few relatively few words. For example, the LOB contains about 45,000 words, most of which occur only once or twice. As a result, 3-4 per cent of tokens in the test corpus are unseen and cause a significant proportion of errors. A corpus large enough to accurately represent all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. This work argues this may not be necessary, describing variations on HMM-based tagging that facilitate learning from relatively little data, including ending-based approaches, incremental learning strategies, and the use of approximate distributions.

Cite this Paper


BibTeX
@InProceedings{pmlr-vR0-neufeld95a, title = {Part-of-Speech Tagging from "Small" Data Sets}, author = {Neufeld, Eric and Adams, Greg and Choy, Henry and Orthner, Ron and Philip, Tim and Tawfik, Ahmed}, booktitle = {Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics}, pages = {410--416}, year = {1995}, editor = {Fisher, Doug and Lenz, Hans-Joachim}, volume = {R0}, series = {Proceedings of Machine Learning Research}, month = {04--07 Jan}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/r0/neufeld95a/neufeld95a.pdf}, url = {https://proceedings.mlr.press/r0/neufeld95a.html}, abstract = {Many probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the LOB. Using the hidden Markov model method on a 900,000 token training corpus, it is not difficult achieve a success rate of 95 per cent on a 100,000 token test corpus. However, even such large training corpora contain few relatively few words. For example, the LOB contains about 45,000 words, most of which occur only once or twice. As a result, 3-4 per cent of tokens in the test corpus are unseen and cause a significant proportion of errors. A corpus large enough to accurately represent all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. This work argues this may not be necessary, describing variations on HMM-based tagging that facilitate learning from relatively little data, including ending-based approaches, incremental learning strategies, and the use of approximate distributions.}, note = {Reissued by PMLR on 01 May 2022.} }
Endnote
%0 Conference Paper %T Part-of-Speech Tagging from "Small" Data Sets %A Eric Neufeld %A Greg Adams %A Henry Choy %A Ron Orthner %A Tim Philip %A Ahmed Tawfik %B Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 1995 %E Doug Fisher %E Hans-Joachim Lenz %F pmlr-vR0-neufeld95a %I PMLR %P 410--416 %U https://proceedings.mlr.press/r0/neufeld95a.html %V R0 %X Many probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the LOB. Using the hidden Markov model method on a 900,000 token training corpus, it is not difficult achieve a success rate of 95 per cent on a 100,000 token test corpus. However, even such large training corpora contain few relatively few words. For example, the LOB contains about 45,000 words, most of which occur only once or twice. As a result, 3-4 per cent of tokens in the test corpus are unseen and cause a significant proportion of errors. A corpus large enough to accurately represent all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. This work argues this may not be necessary, describing variations on HMM-based tagging that facilitate learning from relatively little data, including ending-based approaches, incremental learning strategies, and the use of approximate distributions. %Z Reissued by PMLR on 01 May 2022.
APA
Neufeld, E., Adams, G., Choy, H., Orthner, R., Philip, T. & Tawfik, A.. (1995). Part-of-Speech Tagging from "Small" Data Sets. Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R0:410-416 Available from https://proceedings.mlr.press/r0/neufeld95a.html. Reissued by PMLR on 01 May 2022.

Related Material