[edit]
Towards Custom AI Benchmarking for the Government of Canada
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:788-795, 2026.
Abstract
The Government of Canada (GC) has several options when selecting artificial intelligence (AI) systems to support its operations. At the same time, AI safety issues motivate it to assess these models for various risks that could harm users seeking information about government operations. Thus, evaluation of AI systems has become an area of concern for the GC. Existing AI benchmarks do not suffice to inform the evaluation/selection process as they are generally not adequate for this. To address this problem, we are building CAN-Bench, a bilingual benchmark designed for the Canadian public service context. Based on a dataset compiled from public GC documents, we automatically generate a bilingual set of high-quality questions around government knowledge, safety, and public service values. This paper describes the methodology for benchmark construction and a comparison of various AI models on the benchmark. Our results indicate that while the AI models we tested are good at answering general knowledge questions about government policies, they are not always aligned with public sector values such as non-partisanship, and can potentially provide unsafe responses in some scenarios.