Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Masatoshi Uehara, Jiawei Huang, Nan Jiang
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:9659-9668, 2020.

Abstract

We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et.al, 2018), (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner, (3) Several additional results that offer further insights, including the sample complexities of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-uehara20a, title = {Minimax Weight and Q-Function Learning for Off-Policy Evaluation}, author = {Uehara, Masatoshi and Huang, Jiawei and Jiang, Nan}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {9659--9668}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/uehara20a/uehara20a.pdf}, url = {https://proceedings.mlr.press/v119/uehara20a.html}, abstract = {We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et.al, 2018), (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner, (3) Several additional results that offer further insights, including the sample complexities of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL.} }
Endnote
%0 Conference Paper %T Minimax Weight and Q-Function Learning for Off-Policy Evaluation %A Masatoshi Uehara %A Jiawei Huang %A Nan Jiang %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-uehara20a %I PMLR %P 9659--9668 %U https://proceedings.mlr.press/v119/uehara20a.html %V 119 %X We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et.al, 2018), (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner, (3) Several additional results that offer further insights, including the sample complexities of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL.
APA
Uehara, M., Huang, J. & Jiang, N.. (2020). Minimax Weight and Q-Function Learning for Off-Policy Evaluation. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:9659-9668 Available from https://proceedings.mlr.press/v119/uehara20a.html.

Related Material