Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback

Simone Drago; Marco Mussi; Alberto Maria Metelli

Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback

Simone Drago, Marco Mussi, Alberto Maria Metelli

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:14499-14514, 2025.

Abstract

The success of sequential decision-making approaches, such as reinforcement learning (RL), is closely tied to the availability of a reward feedback. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linking preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in existing works. Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy dominance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feedback, with promising potential applications in RL from human feedback.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-drago25b,
  title = 	 {Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback},
  author =       {Drago, Simone and Mussi, Marco and Metelli, Alberto Maria},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {14499--14514},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/drago25b/drago25b.pdf},
  url = 	 {https://proceedings.mlr.press/v267/drago25b.html},
  abstract = 	 {The success of sequential decision-making approaches, such as reinforcement learning (RL), is closely tied to the availability of a reward feedback. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linking preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in existing works. Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy dominance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feedback, with promising potential applications in RL from human feedback.}
}

Endnote

%0 Conference Paper
%T Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback
%A Simone Drago
%A Marco Mussi
%A Alberto Maria Metelli
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-drago25b
%I PMLR
%P 14499--14514
%U https://proceedings.mlr.press/v267/drago25b.html
%V 267
%X The success of sequential decision-making approaches, such as reinforcement learning (RL), is closely tied to the availability of a reward feedback. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linking preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in existing works. Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy dominance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feedback, with promising potential applications in RL from human feedback.

APA

Drago, S., Mussi, M. & Metelli, A.M.. (2025). Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:14499-14514 Available from https://proceedings.mlr.press/v267/drago25b.html.

Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback

Abstract

Cite this Paper

Related Material