Return Capping: Sample Efficient CVaR Policy Gradient Optimisation

Harry Mead, Clarissa Costen, Bruno Lacerda, Nick Hawes
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:43503-43518, 2025.

Abstract

When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-mead25a, title = {Return Capping: Sample Efficient {CV}a{R} Policy Gradient Optimisation}, author = {Mead, Harry and Costen, Clarissa and Lacerda, Bruno and Hawes, Nick}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {43503--43518}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mead25a/mead25a.pdf}, url = {https://proceedings.mlr.press/v267/mead25a.html}, abstract = {When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.} }
Endnote
%0 Conference Paper %T Return Capping: Sample Efficient CVaR Policy Gradient Optimisation %A Harry Mead %A Clarissa Costen %A Bruno Lacerda %A Nick Hawes %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-mead25a %I PMLR %P 43503--43518 %U https://proceedings.mlr.press/v267/mead25a.html %V 267 %X When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.
APA
Mead, H., Costen, C., Lacerda, B. & Hawes, N.. (2025). Return Capping: Sample Efficient CVaR Policy Gradient Optimisation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:43503-43518 Available from https://proceedings.mlr.press/v267/mead25a.html.

Related Material