SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models

Jiawei Zhang; Xuan Yang; Taiqi Wang; Yu Yao; Aleksandr Petiushko; Bo Li

SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models

Jiawei Zhang, Xuan Yang, Taiqi Wang, Yu Yao, Aleksandr Petiushko, Bo Li

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76497-76517, 2025.

Abstract

Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge. To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light =$>$ stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes. Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available at https://github.com/AI-secure/SafeAuto.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zhang25cm,
  title = 	 {{S}afe{A}uto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models},
  author =       {Zhang, Jiawei and Yang, Xuan and Wang, Taiqi and Yao, Yu and Petiushko, Aleksandr and Li, Bo},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {76497--76517},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25cm/zhang25cm.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zhang25cm.html},
  abstract = 	 {Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge. To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light =$>$ stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes. Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available at https://github.com/AI-secure/SafeAuto.}
}

Endnote

%0 Conference Paper
%T SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models
%A Jiawei Zhang
%A Xuan Yang
%A Taiqi Wang
%A Yu Yao
%A Aleksandr Petiushko
%A Bo Li
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zhang25cm
%I PMLR
%P 76497--76517
%U https://proceedings.mlr.press/v267/zhang25cm.html
%V 267
%X Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge. To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light =$>$ stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes. Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available at https://github.com/AI-secure/SafeAuto.

APA

Zhang, J., Yang, X., Wang, T., Yao, Y., Petiushko, A. & Li, B.. (2025). SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76497-76517 Available from https://proceedings.mlr.press/v267/zhang25cm.html.

SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models

Abstract

Cite this Paper

Related Material