Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs

Zeyu Dong, Yimin Zhu, Yansong Li, Kevin Mahon, Yu Sun
Proceedings of The 8th Conference on Robot Learning, PMLR 270:1231-1249, 2025.

Abstract

Traditional autonomous driving methods adopt modular design, decomposing tasks into sub-tasks, including perception, prediction, planning, and control. In contrast, end-to-end autonomous driving directly outputs actions from raw sensor data, avoiding error accumulation. However, training an end-to-end model requires a comprehensive dataset. Without adequate data, the end-to-end model exhibits poor generalization capabilities. Recently, large language models (LLMs) have been applied to enhance the generalization property of end-to-end driving models. Most studies explore LLMs in an open-loop manner, where the output actions are compared to those of experts without direct activation in the real world. Other studies in closed-loop settings examine their results in simulated environments. In comparison, this paper proposes an efficient architecture that integrates multimodal LLMs into end-to-end real-world driving models in a closed-loop setting. The LLM periodically takes raw sensor data to generate high-level driving instructions. In our architecture, LLMs can effectively guide the end-to-end model, even at a slower rate than the raw sensor data, because updates aren’t needed every time frame. This architecture relaxes the trade-off between the latency and inference quality of the LLM. It also allows us to choose a wide variety of LLMs to improve high-level driving instructions and minimize fine-tuning costs. Consequently, our architecture reduces the data collection requirements because the LLMs do not directly output actions, and we only need to train a simple imitation learning model to output actions. In our experiments, the training data for the end-to-end model in a real-world environment consists of only simple obstacle configurations with one traffic cone, while the test environment is more complex and contains different types of obstacles. Experiments show that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-dong25a, title = {Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs}, author = {Dong, Zeyu and Zhu, Yimin and Li, Yansong and Mahon, Kevin and Sun, Yu}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {1231--1249}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/dong25a/dong25a.pdf}, url = {https://proceedings.mlr.press/v270/dong25a.html}, abstract = {Traditional autonomous driving methods adopt modular design, decomposing tasks into sub-tasks, including perception, prediction, planning, and control. In contrast, end-to-end autonomous driving directly outputs actions from raw sensor data, avoiding error accumulation. However, training an end-to-end model requires a comprehensive dataset. Without adequate data, the end-to-end model exhibits poor generalization capabilities. Recently, large language models (LLMs) have been applied to enhance the generalization property of end-to-end driving models. Most studies explore LLMs in an open-loop manner, where the output actions are compared to those of experts without direct activation in the real world. Other studies in closed-loop settings examine their results in simulated environments. In comparison, this paper proposes an efficient architecture that integrates multimodal LLMs into end-to-end real-world driving models in a closed-loop setting. The LLM periodically takes raw sensor data to generate high-level driving instructions. In our architecture, LLMs can effectively guide the end-to-end model, even at a slower rate than the raw sensor data, because updates aren’t needed every time frame. This architecture relaxes the trade-off between the latency and inference quality of the LLM. It also allows us to choose a wide variety of LLMs to improve high-level driving instructions and minimize fine-tuning costs. Consequently, our architecture reduces the data collection requirements because the LLMs do not directly output actions, and we only need to train a simple imitation learning model to output actions. In our experiments, the training data for the end-to-end model in a real-world environment consists of only simple obstacle configurations with one traffic cone, while the test environment is more complex and contains different types of obstacles. Experiments show that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.} }
Endnote
%0 Conference Paper %T Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs %A Zeyu Dong %A Yimin Zhu %A Yansong Li %A Kevin Mahon %A Yu Sun %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-dong25a %I PMLR %P 1231--1249 %U https://proceedings.mlr.press/v270/dong25a.html %V 270 %X Traditional autonomous driving methods adopt modular design, decomposing tasks into sub-tasks, including perception, prediction, planning, and control. In contrast, end-to-end autonomous driving directly outputs actions from raw sensor data, avoiding error accumulation. However, training an end-to-end model requires a comprehensive dataset. Without adequate data, the end-to-end model exhibits poor generalization capabilities. Recently, large language models (LLMs) have been applied to enhance the generalization property of end-to-end driving models. Most studies explore LLMs in an open-loop manner, where the output actions are compared to those of experts without direct activation in the real world. Other studies in closed-loop settings examine their results in simulated environments. In comparison, this paper proposes an efficient architecture that integrates multimodal LLMs into end-to-end real-world driving models in a closed-loop setting. The LLM periodically takes raw sensor data to generate high-level driving instructions. In our architecture, LLMs can effectively guide the end-to-end model, even at a slower rate than the raw sensor data, because updates aren’t needed every time frame. This architecture relaxes the trade-off between the latency and inference quality of the LLM. It also allows us to choose a wide variety of LLMs to improve high-level driving instructions and minimize fine-tuning costs. Consequently, our architecture reduces the data collection requirements because the LLMs do not directly output actions, and we only need to train a simple imitation learning model to output actions. In our experiments, the training data for the end-to-end model in a real-world environment consists of only simple obstacle configurations with one traffic cone, while the test environment is more complex and contains different types of obstacles. Experiments show that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.
APA
Dong, Z., Zhu, Y., Li, Y., Mahon, K. & Sun, Y.. (2025). Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:1231-1249 Available from https://proceedings.mlr.press/v270/dong25a.html.

Related Material