On Retrieval Properties of Samples of Large Collections

David Madigan, Yehuda Vardi, Ishay Weissman
Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, PMLR R4:203-208, 2003.

Abstract

We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC) where the participants compare the empirical performance of different approaches. $P@K$, the proportion of the top $K$ documents that are relevant, is a popular measure of retrieval effectiveness. Participants in the TREC Very Large Corpus track have observed that $P @ K$ increases substantially when moving from a sample to the full collection. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper we present a mathematical analysis of the phenomenon. We will also introduce "contamination at $K, "$ the number of irrelevant documents amongst the top $K$ relevant documents, and describe its properties. Our analysis shows that while $P @ K$ typically will increase with collection size, the phenomenon is not universal. That is, there exist score distributions for which $P @ K$ (and $C @ K$ ) approach a constant limit as collection size increases.

Cite this Paper


BibTeX
@InProceedings{pmlr-vR4-madigan03a, title = {On Retrieval Properties of Samples of Large Collections}, author = {Madigan, David and Vardi, Yehuda and Weissman, Ishay}, booktitle = {Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics}, pages = {203--208}, year = {2003}, editor = {Bishop, Christopher M. and Frey, Brendan J.}, volume = {R4}, series = {Proceedings of Machine Learning Research}, month = {03--06 Jan}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/r4/madigan03a/madigan03a.pdf}, url = {https://proceedings.mlr.press/r4/madigan03a.html}, abstract = {We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC) where the participants compare the empirical performance of different approaches. $P@K$, the proportion of the top $K$ documents that are relevant, is a popular measure of retrieval effectiveness. Participants in the TREC Very Large Corpus track have observed that $P @ K$ increases substantially when moving from a sample to the full collection. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper we present a mathematical analysis of the phenomenon. We will also introduce "contamination at $K, "$ the number of irrelevant documents amongst the top $K$ relevant documents, and describe its properties. Our analysis shows that while $P @ K$ typically will increase with collection size, the phenomenon is not universal. That is, there exist score distributions for which $P @ K$ (and $C @ K$ ) approach a constant limit as collection size increases.}, note = {Reissued by PMLR on 01 April 2021.} }
Endnote
%0 Conference Paper %T On Retrieval Properties of Samples of Large Collections %A David Madigan %A Yehuda Vardi %A Ishay Weissman %B Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2003 %E Christopher M. Bishop %E Brendan J. Frey %F pmlr-vR4-madigan03a %I PMLR %P 203--208 %U https://proceedings.mlr.press/r4/madigan03a.html %V R4 %X We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC) where the participants compare the empirical performance of different approaches. $P@K$, the proportion of the top $K$ documents that are relevant, is a popular measure of retrieval effectiveness. Participants in the TREC Very Large Corpus track have observed that $P @ K$ increases substantially when moving from a sample to the full collection. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper we present a mathematical analysis of the phenomenon. We will also introduce "contamination at $K, "$ the number of irrelevant documents amongst the top $K$ relevant documents, and describe its properties. Our analysis shows that while $P @ K$ typically will increase with collection size, the phenomenon is not universal. That is, there exist score distributions for which $P @ K$ (and $C @ K$ ) approach a constant limit as collection size increases. %Z Reissued by PMLR on 01 April 2021.
APA
Madigan, D., Vardi, Y. & Weissman, I.. (2003). On Retrieval Properties of Samples of Large Collections. Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R4:203-208 Available from https://proceedings.mlr.press/r4/madigan03a.html. Reissued by PMLR on 01 April 2021.

Related Material