Reinforcement Learning from Multi-level and Episodic Human Feedback

Muhammad Qasim Elahi, Somtochukwu Oguchienti, Maheed H. Ahmed, Mahsa Ghasemi
Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:1180-1193, 2025.

Abstract

Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms of human input to specify or refine the reward function. Reinforcement learning from human feedback is a prominent approach that utilizes human comparative feedback—expressed as a preference for one behavior over another—to tackle this problem. In contrast to comparative feedback, we explore multi-level human feedback, which is provided in the form of a score at the end of each episode. This type of feedback offers more coarse but informative signals about the underlying reward function than binary feedback. Additionally, it can handle non-Markovian rewards, as it is based on an entire episode’s evaluation. We propose an algorithm to efficiently learn both the reward function and the optimal policy from this form of feedback. Moreover, we show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v283-elahi25a, title = {Reinforcement Learning from Multi-level and Episodic Human Feedback}, author = {Elahi, Muhammad Qasim and Oguchienti, Somtochukwu and Ahmed, Maheed H. and Ghasemi, Mahsa}, booktitle = {Proceedings of the 7th Annual Learning for Dynamics \& Control Conference}, pages = {1180--1193}, year = {2025}, editor = {Ozay, Necmiye and Balzano, Laura and Panagou, Dimitra and Abate, Alessandro}, volume = {283}, series = {Proceedings of Machine Learning Research}, month = {04--06 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v283/main/assets/elahi25a/elahi25a.pdf}, url = {https://proceedings.mlr.press/v283/elahi25a.html}, abstract = {Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms of human input to specify or refine the reward function. Reinforcement learning from human feedback is a prominent approach that utilizes human comparative feedback—expressed as a preference for one behavior over another—to tackle this problem. In contrast to comparative feedback, we explore multi-level human feedback, which is provided in the form of a score at the end of each episode. This type of feedback offers more coarse but informative signals about the underlying reward function than binary feedback. Additionally, it can handle non-Markovian rewards, as it is based on an entire episode’s evaluation. We propose an algorithm to efficiently learn both the reward function and the optimal policy from this form of feedback. Moreover, we show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.} }
Endnote
%0 Conference Paper %T Reinforcement Learning from Multi-level and Episodic Human Feedback %A Muhammad Qasim Elahi %A Somtochukwu Oguchienti %A Maheed H. Ahmed %A Mahsa Ghasemi %B Proceedings of the 7th Annual Learning for Dynamics \& Control Conference %C Proceedings of Machine Learning Research %D 2025 %E Necmiye Ozay %E Laura Balzano %E Dimitra Panagou %E Alessandro Abate %F pmlr-v283-elahi25a %I PMLR %P 1180--1193 %U https://proceedings.mlr.press/v283/elahi25a.html %V 283 %X Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms of human input to specify or refine the reward function. Reinforcement learning from human feedback is a prominent approach that utilizes human comparative feedback—expressed as a preference for one behavior over another—to tackle this problem. In contrast to comparative feedback, we explore multi-level human feedback, which is provided in the form of a score at the end of each episode. This type of feedback offers more coarse but informative signals about the underlying reward function than binary feedback. Additionally, it can handle non-Markovian rewards, as it is based on an entire episode’s evaluation. We propose an algorithm to efficiently learn both the reward function and the optimal policy from this form of feedback. Moreover, we show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.
APA
Elahi, M.Q., Oguchienti, S., Ahmed, M.H. & Ghasemi, M.. (2025). Reinforcement Learning from Multi-level and Episodic Human Feedback. Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, in Proceedings of Machine Learning Research 283:1180-1193 Available from https://proceedings.mlr.press/v283/elahi25a.html.

Related Material