[edit]
Part-of-Speech Tagging from "Small" Data Sets
Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, PMLR R0:410-416, 1995.
Abstract
Many probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the LOB. Using the hidden Markov model method on a 900,000 token training corpus, it is not difficult achieve a success rate of 95 per cent on a 100,000 token test corpus. However, even such large training corpora contain few relatively few words. For example, the LOB contains about 45,000 words, most of which occur only once or twice. As a result, 3-4 per cent of tokens in the test corpus are unseen and cause a significant proportion of errors. A corpus large enough to accurately represent all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. This work argues this may not be necessary, describing variations on HMM-based tagging that facilitate learning from relatively little data, including ending-based approaches, incremental learning strategies, and the use of approximate distributions.