The Limits of Predicting Agents from Behaviour

Alexis Bellot, Jonathan Richens, Tom Everitt
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:3623-3658, 2025.

Abstract

As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent’s behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-bellot25a, title = {The Limits of Predicting Agents from Behaviour}, author = {Bellot, Alexis and Richens, Jonathan and Everitt, Tom}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {3623--3658}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/bellot25a/bellot25a.pdf}, url = {https://proceedings.mlr.press/v267/bellot25a.html}, abstract = {As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent’s behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.} }
Endnote
%0 Conference Paper %T The Limits of Predicting Agents from Behaviour %A Alexis Bellot %A Jonathan Richens %A Tom Everitt %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-bellot25a %I PMLR %P 3623--3658 %U https://proceedings.mlr.press/v267/bellot25a.html %V 267 %X As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent’s behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.
APA
Bellot, A., Richens, J. & Everitt, T.. (2025). The Limits of Predicting Agents from Behaviour. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:3623-3658 Available from https://proceedings.mlr.press/v267/bellot25a.html.

Related Material