[edit]
On Retrieval Properties of Samples of Large Collections
Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, PMLR R4:203-208, 2003.
Abstract
We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC) where the participants compare the empirical performance of different approaches. $P@K$, the proportion of the top $K$ documents that are relevant, is a popular measure of retrieval effectiveness. Participants in the TREC Very Large Corpus track have observed that $P @ K$ increases substantially when moving from a sample to the full collection. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper we present a mathematical analysis of the phenomenon. We will also introduce "contamination at $K, "$ the number of irrelevant documents amongst the top $K$ relevant documents, and describe its properties. Our analysis shows that while $P @ K$ typically will increase with collection size, the phenomenon is not universal. That is, there exist score distributions for which $P @ K$ (and $C @ K$ ) approach a constant limit as collection size increases.