RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines

Alejandro Salinas-Medina; Xue Liu

RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines

Alejandro Salinas-Medina, Xue Liu

Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:709-722, 2026.

Abstract

False negatives—missed detections of harmful content—remain the dominant risk in safety-critical moderation pipelines. We introduce RAG-Safe, a recall-first framework that integrates distribution-preserving contrastive augmentation, committee-diverse retrieval, and a recall-oriented decision policy into a unified moderation architecture. The framework is evaluated using a compact, fully auditable testbed designed to enforce strict leakage control: original samples alone determine the train–test split, and all paraphrases inherit their parent assignment. Within this controlled setting, conventional retrieval-augmented pipelines—both commercial (API embeddings + hosted LLM) and open-source (FAISS + local LLaMA-3)—consistently under-detect unsafe content (FLAGGED recall 0.44). Applying RAG-Safe raises FLAGGED recall to approximately 0.56 across both stacks while preserving overall accuracy ( 0.66) and macro-F1 ( 0.65). A non-RAG classifier baseline provided in our public repository shows similar recallfirst behaviour, reinforcing that these gains are not architecture-specific. Rather than comparing individual model components, we interpret the results as pipeline-level evidence that boundary-focused augmentation, retrieval diversity, and calibrated thresholds jointly shift LLM moderation into a safer operating regime. We conclude by discussing limitations—particularly domain transferability and adversarial robustness—and outline directions for scaling RAG-Safe to broader moderation contexts. Keywords: Content moderation, Recall-first classification, Distribution-preserving data augmentation, Committee-based retrieval, Retrieval-augmented large language models, Safety-critical AI

Cite this Paper

BibTeX

@InProceedings{pmlr-v318-salinas-medina26a,
  title = 	 {RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines},
  author =       {Salinas-Medina, Alejandro and Liu, Xue},
  booktitle = 	 {Proceedings of the The 39th Canadian Conference on Artificial Intelligence},
  pages = 	 {709--722},
  year = 	 {2026},
  editor = 	 {Bouzar-Benlabiod, Lydia and Leung, Carson},
  volume = 	 {318},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--29 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v318/main/assets/salinas-medina26a/salinas-medina26a.pdf},
  url = 	 {https://proceedings.mlr.press/v318/salinas-medina26a.html},
  abstract = 	 {False negatives—missed detections of harmful content—remain the dominant risk in safety-critical moderation pipelines. We introduce RAG-Safe, a recall-first framework that integrates distribution-preserving contrastive augmentation, committee-diverse retrieval, and a recall-oriented decision policy into a unified moderation architecture. The framework is evaluated using a compact, fully auditable testbed designed to enforce strict leakage control: original samples alone determine the train–test split, and all paraphrases inherit their parent assignment. Within this controlled setting, conventional retrieval-augmented pipelines—both commercial (API embeddings + hosted LLM) and open-source (FAISS + local LLaMA-3)—consistently under-detect unsafe content (FLAGGED recall 0.44). Applying RAG-Safe raises FLAGGED recall to approximately 0.56 across both stacks while preserving overall accuracy ( 0.66) and macro-F1 ( 0.65). A non-RAG classifier baseline provided in our public repository shows similar recallfirst behaviour, reinforcing that these gains are not architecture-specific. Rather than comparing individual model components, we interpret the results as pipeline-level evidence that boundary-focused augmentation, retrieval diversity, and calibrated thresholds jointly shift LLM moderation into a safer operating regime. We conclude by discussing limitations—particularly domain transferability and adversarial robustness—and outline directions for scaling RAG-Safe to broader moderation contexts. Keywords: Content moderation, Recall-first classification, Distribution-preserving data augmentation, Committee-based retrieval, Retrieval-augmented large language models, Safety-critical AI}
}

Endnote

%0 Conference Paper
%T RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines
%A Alejandro Salinas-Medina
%A Xue Liu
%B Proceedings of the The 39th Canadian Conference on Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2026
%E Lydia Bouzar-Benlabiod
%E Carson Leung	
%F pmlr-v318-salinas-medina26a
%I PMLR
%P 709--722
%U https://proceedings.mlr.press/v318/salinas-medina26a.html
%V 318
%X False negatives—missed detections of harmful content—remain the dominant risk in safety-critical moderation pipelines. We introduce RAG-Safe, a recall-first framework that integrates distribution-preserving contrastive augmentation, committee-diverse retrieval, and a recall-oriented decision policy into a unified moderation architecture. The framework is evaluated using a compact, fully auditable testbed designed to enforce strict leakage control: original samples alone determine the train–test split, and all paraphrases inherit their parent assignment. Within this controlled setting, conventional retrieval-augmented pipelines—both commercial (API embeddings + hosted LLM) and open-source (FAISS + local LLaMA-3)—consistently under-detect unsafe content (FLAGGED recall 0.44). Applying RAG-Safe raises FLAGGED recall to approximately 0.56 across both stacks while preserving overall accuracy ( 0.66) and macro-F1 ( 0.65). A non-RAG classifier baseline provided in our public repository shows similar recallfirst behaviour, reinforcing that these gains are not architecture-specific. Rather than comparing individual model components, we interpret the results as pipeline-level evidence that boundary-focused augmentation, retrieval diversity, and calibrated thresholds jointly shift LLM moderation into a safer operating regime. We conclude by discussing limitations—particularly domain transferability and adversarial robustness—and outline directions for scaling RAG-Safe to broader moderation contexts. Keywords: Content moderation, Recall-first classification, Distribution-preserving data augmentation, Committee-based retrieval, Retrieval-augmented large language models, Safety-critical AI

APA

Salinas-Medina, A. & Liu, X.. (2026). RAG-Safe: A Recall-First Safety Framework Comparing Open-Source and Commercial LLM Moderation Pipelines. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:709-722 Available from https://proceedings.mlr.press/v318/salinas-medina26a.html.

Related Material

Download PDF