Defending RAG Against Knowledge Poisoning Using Cross-Encoder Activation Signals

Razieh Moradi, Havva Alizadeh Noughabi, Fattane Zarrinkalam, Ali Dehghantanha
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:366-376, 2026.

Abstract

Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in externally retrieved evidence, but it also inherits security risks from the underlying corpus. In particular, an adversary can poison the knowledge source so that injected passages are retrieved and steer the model toward attacker-chosen targets. We propose Cross-Encoder Guardian RAG (CEG-RAG), a defense framework that leverages the internal activations of a cross-encoder reranker to detect and mitigate knowledge poisoning in RAG pipelines. CEG-RAG uses multi-instance learning (MIL) to jointly (i) detect whether the retrieved context is poisoned and (ii) localize suspicious chunks. Upon detection, it repairs the context by filtering and replacing high-risk chunks prior to answer generation while preserving a fixed context budget. Across three open-domain QA benchmarks—MS MARCO, Natural Questions (NQ), and HotpotQA—under a poisoning attack, \textsc{CEG-RAG} achieves high detection and localization performance (TPR >85% and >88.4%, respectively, at very low FPR), reduces the attack success rate (ASR) by an average of 88.74%, and recovers correct answers. Compared to recent baseline defenses, CEG-RAG consistently provides stronger protection, and a reranker sensitivity study demonstrates its robustness across different reranker configurations. These results position cross-encoder reranker activations as a practical foundation for securing RAG against knowledge poisoning. The code and data are available at https://github.com/CyberScienceLab/CEG-RAG.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-moradi26a, title = {Defending RAG Against Knowledge Poisoning Using Cross-Encoder Activation Signals}, author = {Moradi, Razieh and Noughabi, Havva Alizadeh and Zarrinkalam, Fattane and Dehghantanha, Ali}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {366--376}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/moradi26a/moradi26a.pdf}, url = {https://proceedings.mlr.press/v318/moradi26a.html}, abstract = {Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in externally retrieved evidence, but it also inherits security risks from the underlying corpus. In particular, an adversary can poison the knowledge source so that injected passages are retrieved and steer the model toward attacker-chosen targets. We propose Cross-Encoder Guardian RAG (CEG-RAG), a defense framework that leverages the internal activations of a cross-encoder reranker to detect and mitigate knowledge poisoning in RAG pipelines. CEG-RAG uses multi-instance learning (MIL) to jointly (i) detect whether the retrieved context is poisoned and (ii) localize suspicious chunks. Upon detection, it repairs the context by filtering and replacing high-risk chunks prior to answer generation while preserving a fixed context budget. Across three open-domain QA benchmarks—MS MARCO, Natural Questions (NQ), and HotpotQA—under a poisoning attack, \textsc{CEG-RAG} achieves high detection and localization performance (TPR >85% and >88.4%, respectively, at very low FPR), reduces the attack success rate (ASR) by an average of 88.74%, and recovers correct answers. Compared to recent baseline defenses, CEG-RAG consistently provides stronger protection, and a reranker sensitivity study demonstrates its robustness across different reranker configurations. These results position cross-encoder reranker activations as a practical foundation for securing RAG against knowledge poisoning. The code and data are available at https://github.com/CyberScienceLab/CEG-RAG.} }
Endnote
%0 Conference Paper %T Defending RAG Against Knowledge Poisoning Using Cross-Encoder Activation Signals %A Razieh Moradi %A Havva Alizadeh Noughabi %A Fattane Zarrinkalam %A Ali Dehghantanha %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-moradi26a %I PMLR %P 366--376 %U https://proceedings.mlr.press/v318/moradi26a.html %V 318 %X Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in externally retrieved evidence, but it also inherits security risks from the underlying corpus. In particular, an adversary can poison the knowledge source so that injected passages are retrieved and steer the model toward attacker-chosen targets. We propose Cross-Encoder Guardian RAG (CEG-RAG), a defense framework that leverages the internal activations of a cross-encoder reranker to detect and mitigate knowledge poisoning in RAG pipelines. CEG-RAG uses multi-instance learning (MIL) to jointly (i) detect whether the retrieved context is poisoned and (ii) localize suspicious chunks. Upon detection, it repairs the context by filtering and replacing high-risk chunks prior to answer generation while preserving a fixed context budget. Across three open-domain QA benchmarks—MS MARCO, Natural Questions (NQ), and HotpotQA—under a poisoning attack, \textsc{CEG-RAG} achieves high detection and localization performance (TPR >85% and >88.4%, respectively, at very low FPR), reduces the attack success rate (ASR) by an average of 88.74%, and recovers correct answers. Compared to recent baseline defenses, CEG-RAG consistently provides stronger protection, and a reranker sensitivity study demonstrates its robustness across different reranker configurations. These results position cross-encoder reranker activations as a practical foundation for securing RAG against knowledge poisoning. The code and data are available at https://github.com/CyberScienceLab/CEG-RAG.
APA
Moradi, R., Noughabi, H.A., Zarrinkalam, F. & Dehghantanha, A.. (2026). Defending RAG Against Knowledge Poisoning Using Cross-Encoder Activation Signals. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:366-376 Available from https://proceedings.mlr.press/v318/moradi26a.html.

Related Material