Smaller, Smarter, Greener: Reducing LLM Inference Emissions with RAG

Ethan Heavey, Paul Cook
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:450-463, 2026.

Abstract

The escalating computational demands of Large Language Models (LLMs) raise significant concerns regarding their environmental sustainability. While prior work has quantified training emissions, inference - which dominates a model’s lifecycle carbon footprint - remains underexplored in holistic evaluations that jointly consider efficiency and effectiveness. This study investigates whether smaller models augmented with Retrieval-Augmented Generation (RAG) can achieve Pareto-optimal configurations that balance accuracy and carbon emissions better than larger, non-RAG models. We conduct experiments across three model families (DeepSeek-r1, Qwen3, Gemma 3) on two question answering datasets (HotpotQA, Natural Questions), measuring end-to-end emissions using CodeCarbon. Our results show that on Natural Questions, RAG enables models as small as 0.6B parameters to outperform 12B-32B models in terms of F1 score with lower carbon emissions, in some cases achieving up to 90% emission reductions. However, on HotpotQA, the efficiency benefits are more nuanced, with RAG consistently improving F1, but not always reducing emissions. Our work provides a systematic analysis of the efficiency-effectiveness trade-off of incorporating RAG, offering practical guidance for environmentally sustainable AI.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-heavey26a, title = {Smaller, Smarter, Greener: Reducing LLM Inference Emissions with RAG}, author = {Heavey, Ethan and Cook, Paul}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {450--463}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/heavey26a/heavey26a.pdf}, url = {https://proceedings.mlr.press/v318/heavey26a.html}, abstract = {The escalating computational demands of Large Language Models (LLMs) raise significant concerns regarding their environmental sustainability. While prior work has quantified training emissions, inference - which dominates a model’s lifecycle carbon footprint - remains underexplored in holistic evaluations that jointly consider efficiency and effectiveness. This study investigates whether smaller models augmented with Retrieval-Augmented Generation (RAG) can achieve Pareto-optimal configurations that balance accuracy and carbon emissions better than larger, non-RAG models. We conduct experiments across three model families (DeepSeek-r1, Qwen3, Gemma 3) on two question answering datasets (HotpotQA, Natural Questions), measuring end-to-end emissions using CodeCarbon. Our results show that on Natural Questions, RAG enables models as small as 0.6B parameters to outperform 12B-32B models in terms of F1 score with lower carbon emissions, in some cases achieving up to 90% emission reductions. However, on HotpotQA, the efficiency benefits are more nuanced, with RAG consistently improving F1, but not always reducing emissions. Our work provides a systematic analysis of the efficiency-effectiveness trade-off of incorporating RAG, offering practical guidance for environmentally sustainable AI.} }
Endnote
%0 Conference Paper %T Smaller, Smarter, Greener: Reducing LLM Inference Emissions with RAG %A Ethan Heavey %A Paul Cook %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-heavey26a %I PMLR %P 450--463 %U https://proceedings.mlr.press/v318/heavey26a.html %V 318 %X The escalating computational demands of Large Language Models (LLMs) raise significant concerns regarding their environmental sustainability. While prior work has quantified training emissions, inference - which dominates a model’s lifecycle carbon footprint - remains underexplored in holistic evaluations that jointly consider efficiency and effectiveness. This study investigates whether smaller models augmented with Retrieval-Augmented Generation (RAG) can achieve Pareto-optimal configurations that balance accuracy and carbon emissions better than larger, non-RAG models. We conduct experiments across three model families (DeepSeek-r1, Qwen3, Gemma 3) on two question answering datasets (HotpotQA, Natural Questions), measuring end-to-end emissions using CodeCarbon. Our results show that on Natural Questions, RAG enables models as small as 0.6B parameters to outperform 12B-32B models in terms of F1 score with lower carbon emissions, in some cases achieving up to 90% emission reductions. However, on HotpotQA, the efficiency benefits are more nuanced, with RAG consistently improving F1, but not always reducing emissions. Our work provides a systematic analysis of the efficiency-effectiveness trade-off of incorporating RAG, offering practical guidance for environmentally sustainable AI.
APA
Heavey, E. & Cook, P.. (2026). Smaller, Smarter, Greener: Reducing LLM Inference Emissions with RAG. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:450-463 Available from https://proceedings.mlr.press/v318/heavey26a.html.

Related Material