Leveraging Demonstrations to Improve Online Learning: Quality Matters

Botao Hao, Rahul Jain, Tor Lattimore, Benjamin Van Roy, Zheng Wen
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:12527-12545, 2023.

Abstract

We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes’ rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert’s competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-hao23a, title = {Leveraging Demonstrations to Improve Online Learning: Quality Matters}, author = {Hao, Botao and Jain, Rahul and Lattimore, Tor and Van Roy, Benjamin and Wen, Zheng}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {12527--12545}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/hao23a/hao23a.pdf}, url = {https://proceedings.mlr.press/v202/hao23a.html}, abstract = {We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes’ rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert’s competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.} }
Endnote
%0 Conference Paper %T Leveraging Demonstrations to Improve Online Learning: Quality Matters %A Botao Hao %A Rahul Jain %A Tor Lattimore %A Benjamin Van Roy %A Zheng Wen %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-hao23a %I PMLR %P 12527--12545 %U https://proceedings.mlr.press/v202/hao23a.html %V 202 %X We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes’ rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert’s competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.
APA
Hao, B., Jain, R., Lattimore, T., Van Roy, B. & Wen, Z.. (2023). Leveraging Demonstrations to Improve Online Learning: Quality Matters. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:12527-12545 Available from https://proceedings.mlr.press/v202/hao23a.html.

Related Material