Adaptive Exploration for Multi-Reward Multi-Policy Evaluation

Alessio Russo, Aldo Pacchiano
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52382-52421, 2025.

Abstract

We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(\epsilon,\delta)$-PAC perspective to achieve $\epsilon$-accurate estimates with high confidence over finite or convex sets of rewards, a setting that has not been systematically studied in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme. Code repository: https://github.com/rssalessio/multi-reward-multi-policy-eval.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-russo25a, title = {Adaptive Exploration for Multi-Reward Multi-Policy Evaluation}, author = {Russo, Alessio and Pacchiano, Aldo}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {52382--52421}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/russo25a/russo25a.pdf}, url = {https://proceedings.mlr.press/v267/russo25a.html}, abstract = {We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(\epsilon,\delta)$-PAC perspective to achieve $\epsilon$-accurate estimates with high confidence over finite or convex sets of rewards, a setting that has not been systematically studied in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme. Code repository: https://github.com/rssalessio/multi-reward-multi-policy-eval.} }
Endnote
%0 Conference Paper %T Adaptive Exploration for Multi-Reward Multi-Policy Evaluation %A Alessio Russo %A Aldo Pacchiano %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-russo25a %I PMLR %P 52382--52421 %U https://proceedings.mlr.press/v267/russo25a.html %V 267 %X We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(\epsilon,\delta)$-PAC perspective to achieve $\epsilon$-accurate estimates with high confidence over finite or convex sets of rewards, a setting that has not been systematically studied in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme. Code repository: https://github.com/rssalessio/multi-reward-multi-policy-eval.
APA
Russo, A. & Pacchiano, A.. (2025). Adaptive Exploration for Multi-Reward Multi-Policy Evaluation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:52382-52421 Available from https://proceedings.mlr.press/v267/russo25a.html.

Related Material