RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

David Reber, Sean M Richardson, Todd Nief, Cristina Garbacea, Victor Veitch
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:51341-51368, 2025.

Abstract

Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-reber25a, title = {{RATE}: Causal Explainability of Reward Models with Imperfect Counterfactuals}, author = {Reber, David and Richardson, Sean M and Nief, Todd and Garbacea, Cristina and Veitch, Victor}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {51341--51368}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/reber25a/reber25a.pdf}, url = {https://proceedings.mlr.press/v267/reber25a.html}, abstract = {Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.} }
Endnote
%0 Conference Paper %T RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals %A David Reber %A Sean M Richardson %A Todd Nief %A Cristina Garbacea %A Victor Veitch %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-reber25a %I PMLR %P 51341--51368 %U https://proceedings.mlr.press/v267/reber25a.html %V 267 %X Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.
APA
Reber, D., Richardson, S.M., Nief, T., Garbacea, C. & Veitch, V.. (2025). RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:51341-51368 Available from https://proceedings.mlr.press/v267/reber25a.html.

Related Material