On the Error Probability of Model Selection for Classification
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:513-520, 1997.
We consider model selection based on information criteria for classification. The information criterion is expressed in the form of the empirical entropy plus a compensation term $(k(g)/2)d(n)$, where $k(g)$ is the number of independent parameters in a model $g$, $d(n)$ is a function of $n$, and $n$ is the number of examples. First of all, we derive for arbitrary $d(\cdot)$ the asymptotically exact error probabilities in model selection. Although it was known for linear/autoregression processes that $d(n) = \log \log n$ is the minimum function of n such that the model selection satisfies strong consistency, the problem whether the same thing holds for classification has been open. We solve this problem affirmatively. Additionally, we derive for arbitrary $d(\cdot)$ the expected Kullback-leibler divergence between a true conditional probability and the conditional probability estimated by the model selection and the Laplace estimators. The derived value is $k(g^*)/(2n)$, where $g^*$ is a true model, and the accumulated value over $n$ time instances is $(k(g*)/2) \log n + 0(1)$, which implies the optimality of a predictive coding based on the model selection. Keywords: model selection, error probability, strong con- sistency, Kullback-Leibler divergence, minimum description length principle, Hannan/Quinn’s procedure, unseparated/separated models, Kolmogorov’s law of the iterated logarithm.