RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines

Alejandro Salinas-Medina, Xue Liu
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:709-722, 2026.

Abstract

False negatives—missed detections of harmful content—remain the dominant risk in safety-critical moderation pipelines. We introduce RAG-Safe, a recall-first framework that integrates distribution-preserving contrastive augmentation, committee-diverse retrieval, and a recall-oriented decision policy into a unified moderation architecture. The framework is evaluated using a compact, fully auditable testbed designed to enforce strict leakage control: original samples alone determine the train–test split, and all paraphrases inherit their parent assignment. Within this controlled setting, conventional retrieval-augmented pipelines—both commercial (API embeddings + hosted LLM) and open-source (FAISS + local LLaMA-3)—consistently under-detect unsafe content (FLAGGED recall 0.44). Applying RAG-Safe raises FLAGGED recall to approximately 0.56 across both stacks while preserving overall accuracy ( 0.66) and macro-F1 ( 0.65). A non-RAG classifier baseline provided in our public repository shows similar recallfirst behaviour, reinforcing that these gains are not architecture-specific. Rather than comparing individual model components, we interpret the results as pipeline-level evidence that boundary-focused augmentation, retrieval diversity, and calibrated thresholds jointly shift LLM moderation into a safer operating regime. We conclude by discussing limitations—particularly domain transferability and adversarial robustness—and outline directions for scaling RAG-Safe to broader moderation contexts. Keywords: Content moderation, Recall-first classification, Distribution-preserving data augmentation, Committee-based retrieval, Retrieval-augmented large language models, Safety-critical AI

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-salinas-medina26a, title = {RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines}, author = {Salinas-Medina, Alejandro and Liu, Xue}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {709--722}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/salinas-medina26a/salinas-medina26a.pdf}, url = {https://proceedings.mlr.press/v318/salinas-medina26a.html}, abstract = {False negatives—missed detections of harmful content—remain the dominant risk in safety-critical moderation pipelines. We introduce RAG-Safe, a recall-first framework that integrates distribution-preserving contrastive augmentation, committee-diverse retrieval, and a recall-oriented decision policy into a unified moderation architecture. The framework is evaluated using a compact, fully auditable testbed designed to enforce strict leakage control: original samples alone determine the train–test split, and all paraphrases inherit their parent assignment. Within this controlled setting, conventional retrieval-augmented pipelines—both commercial (API embeddings + hosted LLM) and open-source (FAISS + local LLaMA-3)—consistently under-detect unsafe content (FLAGGED recall 0.44). Applying RAG-Safe raises FLAGGED recall to approximately 0.56 across both stacks while preserving overall accuracy ( 0.66) and macro-F1 ( 0.65). A non-RAG classifier baseline provided in our public repository shows similar recallfirst behaviour, reinforcing that these gains are not architecture-specific. Rather than comparing individual model components, we interpret the results as pipeline-level evidence that boundary-focused augmentation, retrieval diversity, and calibrated thresholds jointly shift LLM moderation into a safer operating regime. We conclude by discussing limitations—particularly domain transferability and adversarial robustness—and outline directions for scaling RAG-Safe to broader moderation contexts. Keywords: Content moderation, Recall-first classification, Distribution-preserving data augmentation, Committee-based retrieval, Retrieval-augmented large language models, Safety-critical AI} }
Endnote
%0 Conference Paper %T RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines %A Alejandro Salinas-Medina %A Xue Liu %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-salinas-medina26a %I PMLR %P 709--722 %U https://proceedings.mlr.press/v318/salinas-medina26a.html %V 318 %X False negatives—missed detections of harmful content—remain the dominant risk in safety-critical moderation pipelines. We introduce RAG-Safe, a recall-first framework that integrates distribution-preserving contrastive augmentation, committee-diverse retrieval, and a recall-oriented decision policy into a unified moderation architecture. The framework is evaluated using a compact, fully auditable testbed designed to enforce strict leakage control: original samples alone determine the train–test split, and all paraphrases inherit their parent assignment. Within this controlled setting, conventional retrieval-augmented pipelines—both commercial (API embeddings + hosted LLM) and open-source (FAISS + local LLaMA-3)—consistently under-detect unsafe content (FLAGGED recall 0.44). Applying RAG-Safe raises FLAGGED recall to approximately 0.56 across both stacks while preserving overall accuracy ( 0.66) and macro-F1 ( 0.65). A non-RAG classifier baseline provided in our public repository shows similar recallfirst behaviour, reinforcing that these gains are not architecture-specific. Rather than comparing individual model components, we interpret the results as pipeline-level evidence that boundary-focused augmentation, retrieval diversity, and calibrated thresholds jointly shift LLM moderation into a safer operating regime. We conclude by discussing limitations—particularly domain transferability and adversarial robustness—and outline directions for scaling RAG-Safe to broader moderation contexts. Keywords: Content moderation, Recall-first classification, Distribution-preserving data augmentation, Committee-based retrieval, Retrieval-augmented large language models, Safety-critical AI
APA
Salinas-Medina, A. & Liu, X.. (2026). RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:709-722 Available from https://proceedings.mlr.press/v318/salinas-medina26a.html.

Related Material