[edit]
Universality of Bayesian mixture predictors
Proceedings of the 28th International Conference on Algorithmic Learning Theory, PMLR 76:57-71, 2017.
Abstract
The problem is that of sequential probability forecasting for discrete-valued time series. The data is generated by an unknown probability distribution over the space of all one-way infinite sequences. It is known that this measure belongs to a given set $\mathcal{C}$, but the latter is completely arbitrary (uncountably infinite, without any structure given). The performance is measured by asymptotic average log loss. In this work it is shown that the minimax asymptotic performance is always attainable, and it is attained by a Bayesian mixture over countably many measures from the set $\mathcal{C}$. This was previously only known for the case when the best achievable asymptotic error is 0. The new result can be interpreted as a complete-class theorem for prediction. It also contrasts previous results that show that in the non-realizable case all Bayesian mixtures may be suboptimal. This leads to a very general conclusion concerning model selection for a problem of sequential inference: it is better to take a model large enough to make sure it includes the process that generates the data, even if it entails positive asymptotic average loss, for otherwise any combination of predictors in the model class may be useless.