ReSa2: A Two-Stage Retrieval-Sampling Algorithm for Negative Sampling in Dense Retrieval

Muyang Li, Zihan Wang, Sijia Chen, Yijun Chen, Jiayu Li, Yiming Qiao, Xinyi Li, Bo Ji
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:686-701, 2025.

Abstract

Negative sampling algorithms are critical for training dense retrievers, which in turn impact retrieval performance in information systems. Among these, hard negative sampling is of great value, and the denoised negative sampling methods in particular. Strategically selecting relevant negative samples, these methods effectively enhance the effectiveness of model training. However, they are either restricted to single-stage retrieval, failing to fully explore potential effective negatives, or demand additional training for a filter, which compromises sampling efficiency. To address this issue, the paper introduces a two-stage Retrieval-Sampling Algorithm(ReSa2). It integrates document vector-based retrieval to refine candidate selection progressively while preserving semantic relevance. In Stage 1, ReSa2 uses query vectors for broad retrieval, generating a candidate subset from the corpus to narrow the search space. In Stage 2, it reuses the retriever to perform positive-centric retrieval within this subset, leveraging positive sample vectors to re-rank candidates and enrich hard negatives with semantic similarity to the query. During the whole process, the effect is further enhanced by conducting probability-weighted sampling on the candidate subset. Insight experiments on 40,000 query-sample pairs show ReSa2 suppresses false negatives by 69.1% compared to Top-K sampling. Specifically, on the Ms Pas dataset, it outperforms the state-of-the-art by 1.2% in MRR@10 and 0.5% in R@1000. Notably, an external validation on Natural Questions (unseen domain) demonstrates ReSa2 maintains robust performance when trained on MS MARCO, highlighting its generalization capability across diverse retrieval scenarios. Ablation experiments validate the complementary roles of the two stages. Our code and appendix are released in https://github.com/ad32q/ReSa2.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-li25d, title = {ReSa2: A Two-Stage Retrieval-Sampling Algorithm for Negative Sampling in Dense Retrieval}, author = {Li, Muyang and Wang, Zihan and Chen, Sijia and Chen, Yijun and Li, Jiayu and Qiao, Yiming and Li, Xinyi and Ji, Bo}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {686--701}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/li25d/li25d.pdf}, url = {https://proceedings.mlr.press/v304/li25d.html}, abstract = {Negative sampling algorithms are critical for training dense retrievers, which in turn impact retrieval performance in information systems. Among these, hard negative sampling is of great value, and the denoised negative sampling methods in particular. Strategically selecting relevant negative samples, these methods effectively enhance the effectiveness of model training. However, they are either restricted to single-stage retrieval, failing to fully explore potential effective negatives, or demand additional training for a filter, which compromises sampling efficiency. To address this issue, the paper introduces a two-stage Retrieval-Sampling Algorithm(ReSa2). It integrates document vector-based retrieval to refine candidate selection progressively while preserving semantic relevance. In Stage 1, ReSa2 uses query vectors for broad retrieval, generating a candidate subset from the corpus to narrow the search space. In Stage 2, it reuses the retriever to perform positive-centric retrieval within this subset, leveraging positive sample vectors to re-rank candidates and enrich hard negatives with semantic similarity to the query. During the whole process, the effect is further enhanced by conducting probability-weighted sampling on the candidate subset. Insight experiments on 40,000 query-sample pairs show ReSa2 suppresses false negatives by 69.1% compared to Top-K sampling. Specifically, on the Ms Pas dataset, it outperforms the state-of-the-art by 1.2% in MRR@10 and 0.5% in R@1000. Notably, an external validation on Natural Questions (unseen domain) demonstrates ReSa2 maintains robust performance when trained on MS MARCO, highlighting its generalization capability across diverse retrieval scenarios. Ablation experiments validate the complementary roles of the two stages. Our code and appendix are released in https://github.com/ad32q/ReSa2.} }
Endnote
%0 Conference Paper %T ReSa2: A Two-Stage Retrieval-Sampling Algorithm for Negative Sampling in Dense Retrieval %A Muyang Li %A Zihan Wang %A Sijia Chen %A Yijun Chen %A Jiayu Li %A Yiming Qiao %A Xinyi Li %A Bo Ji %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-li25d %I PMLR %P 686--701 %U https://proceedings.mlr.press/v304/li25d.html %V 304 %X Negative sampling algorithms are critical for training dense retrievers, which in turn impact retrieval performance in information systems. Among these, hard negative sampling is of great value, and the denoised negative sampling methods in particular. Strategically selecting relevant negative samples, these methods effectively enhance the effectiveness of model training. However, they are either restricted to single-stage retrieval, failing to fully explore potential effective negatives, or demand additional training for a filter, which compromises sampling efficiency. To address this issue, the paper introduces a two-stage Retrieval-Sampling Algorithm(ReSa2). It integrates document vector-based retrieval to refine candidate selection progressively while preserving semantic relevance. In Stage 1, ReSa2 uses query vectors for broad retrieval, generating a candidate subset from the corpus to narrow the search space. In Stage 2, it reuses the retriever to perform positive-centric retrieval within this subset, leveraging positive sample vectors to re-rank candidates and enrich hard negatives with semantic similarity to the query. During the whole process, the effect is further enhanced by conducting probability-weighted sampling on the candidate subset. Insight experiments on 40,000 query-sample pairs show ReSa2 suppresses false negatives by 69.1% compared to Top-K sampling. Specifically, on the Ms Pas dataset, it outperforms the state-of-the-art by 1.2% in MRR@10 and 0.5% in R@1000. Notably, an external validation on Natural Questions (unseen domain) demonstrates ReSa2 maintains robust performance when trained on MS MARCO, highlighting its generalization capability across diverse retrieval scenarios. Ablation experiments validate the complementary roles of the two stages. Our code and appendix are released in https://github.com/ad32q/ReSa2.
APA
Li, M., Wang, Z., Chen, S., Chen, Y., Li, J., Qiao, Y., Li, X. & Ji, B.. (2025). ReSa2: A Two-Stage Retrieval-Sampling Algorithm for Negative Sampling in Dense Retrieval. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:686-701 Available from https://proceedings.mlr.press/v304/li25d.html.

Related Material