Towards Custom AI Benchmarking for the Government of Canada

Gabriel Bernier-Colborne; Yvan Gauthier; Sowmya Vajjala

Towards Custom AI Benchmarking for the Government of Canada

Gabriel Bernier-Colborne, Yvan Gauthier, Sowmya Vajjala

Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:788-795, 2026.

Abstract

The Government of Canada (GC) has several options when selecting artificial intelligence (AI) systems to support its operations. At the same time, AI safety issues motivate it to assess these models for various risks that could harm users seeking information about government operations. Thus, evaluation of AI systems has become an area of concern for the GC. Existing AI benchmarks do not suffice to inform the evaluation/selection process as they are generally not adequate for this. To address this problem, we are building CAN-Bench, a bilingual benchmark designed for the Canadian public service context. Based on a dataset compiled from public GC documents, we automatically generate a bilingual set of high-quality questions around government knowledge, safety, and public service values. This paper describes the methodology for benchmark construction and a comparison of various AI models on the benchmark. Our results indicate that while the AI models we tested are good at answering general knowledge questions about government policies, they are not always aligned with public sector values such as non-partisanship, and can potentially provide unsafe responses in some scenarios.

Cite this Paper

BibTeX

@InProceedings{pmlr-v318-bernier-colborne26a,
  title = 	 {Towards Custom AI Benchmarking for the Government of Canada},
  author =       {Bernier-Colborne, Gabriel and Gauthier, Yvan and Vajjala, Sowmya},
  booktitle = 	 {Proceedings of the The 39th Canadian Conference on Artificial Intelligence},
  pages = 	 {788--795},
  year = 	 {2026},
  editor = 	 {Bouzar-Benlabiod, Lydia and Leung, Carson},
  volume = 	 {318},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--29 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v318/main/assets/bernier-colborne26a/bernier-colborne26a.pdf},
  url = 	 {https://proceedings.mlr.press/v318/bernier-colborne26a.html},
  abstract = 	 {The Government of Canada (GC) has several options when selecting artificial intelligence (AI) systems to support its operations. At the same time, AI safety issues motivate it to assess these models for various risks that could harm users seeking information about government operations. Thus, evaluation of AI systems has become an area of concern for the GC. Existing AI benchmarks do not suffice to inform the evaluation/selection process as they are generally not adequate for this. To address this problem, we are building CAN-Bench, a bilingual benchmark designed for the Canadian public service context. Based on a dataset compiled from public GC documents, we automatically generate a bilingual set of high-quality questions around government knowledge, safety, and public service values. This paper describes the methodology for benchmark construction and a comparison of various AI models on the benchmark. Our results indicate that while the AI models we tested are good at answering general knowledge questions about government policies, they are not always aligned with public sector values such as non-partisanship, and can potentially provide unsafe responses in some scenarios.}
}

Endnote

%0 Conference Paper
%T Towards Custom AI Benchmarking for the Government of Canada
%A Gabriel Bernier-Colborne
%A Yvan Gauthier
%A Sowmya Vajjala
%B Proceedings of the The 39th Canadian Conference on Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2026
%E Lydia Bouzar-Benlabiod
%E Carson Leung	
%F pmlr-v318-bernier-colborne26a
%I PMLR
%P 788--795
%U https://proceedings.mlr.press/v318/bernier-colborne26a.html
%V 318
%X The Government of Canada (GC) has several options when selecting artificial intelligence (AI) systems to support its operations. At the same time, AI safety issues motivate it to assess these models for various risks that could harm users seeking information about government operations. Thus, evaluation of AI systems has become an area of concern for the GC. Existing AI benchmarks do not suffice to inform the evaluation/selection process as they are generally not adequate for this. To address this problem, we are building CAN-Bench, a bilingual benchmark designed for the Canadian public service context. Based on a dataset compiled from public GC documents, we automatically generate a bilingual set of high-quality questions around government knowledge, safety, and public service values. This paper describes the methodology for benchmark construction and a comparison of various AI models on the benchmark. Our results indicate that while the AI models we tested are good at answering general knowledge questions about government policies, they are not always aligned with public sector values such as non-partisanship, and can potentially provide unsafe responses in some scenarios.

APA

Bernier-Colborne, G., Gauthier, Y. & Vajjala, S.. (2026). Towards Custom AI Benchmarking for the Government of Canada. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:788-795 Available from https://proceedings.mlr.press/v318/bernier-colborne26a.html.

Related Material

Download PDF