[edit]
Smaller, Smarter, Greener: Reducing LLM Inference Emissions with RAG
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:450-463, 2026.
Abstract
The escalating computational demands of Large Language Models (LLMs) raise significant concerns regarding their environmental sustainability. While prior work has quantified training emissions, inference - which dominates a model’s lifecycle carbon footprint - remains underexplored in holistic evaluations that jointly consider efficiency and effectiveness. This study investigates whether smaller models augmented with Retrieval-Augmented Generation (RAG) can achieve Pareto-optimal configurations that balance accuracy and carbon emissions better than larger, non-RAG models. We conduct experiments across three model families (DeepSeek-r1, Qwen3, Gemma 3) on two question answering datasets (HotpotQA, Natural Questions), measuring end-to-end emissions using CodeCarbon. Our results show that on Natural Questions, RAG enables models as small as 0.6B parameters to outperform 12B-32B models in terms of F1 score with lower carbon emissions, in some cases achieving up to 90% emission reductions. However, on HotpotQA, the efficiency benefits are more nuanced, with RAG consistently improving F1, but not always reducing emissions. Our work provides a systematic analysis of the efficiency-effectiveness trade-off of incorporating RAG, offering practical guidance for environmentally sustainable AI.