The Curse of Passive Data Collection in Batch Reinforcement Learning

Chenjun Xiao, Ilbin Lee, Bo Dai, Dale Schuurmans, Csaba Szepesvari
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:8413-8438, 2022.

Abstract

In high stake applications, active experimentation may be considered too risky and thus data are often collected passively. While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states. The main focus of the current paper is the characterization of this price. For example, when learning in episodic finite state-action Markov decision processes (MDPs) with $S$ states and $A$ actions, we show that even with the best (but passively chosen) logging policy, $\Omega(A^{\min(\rS-1, H)}/\varepsilon^2)$ episodes are necessary (and sufficient) to obtain an $\epsilon$-optimal policy, where $H$ is the length of episodes. Note that this shows that the sample complexity blows up exponentially compared to the case of active data collection, a result which is not unexpected, but, as far as we know, have not been published beforehand and perhaps the form of the exact expression is a little surprising. We also extend these results in various directions, such as other criteria or learning in the presence of function approximation, with similar conclusions. A remarkable feature of our result is the sharp characterization of the exponent that appears, which is critical for understanding what makes passive learning hard.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-xiao22b, title = { The Curse of Passive Data Collection in Batch Reinforcement Learning }, author = {Xiao, Chenjun and Lee, Ilbin and Dai, Bo and Schuurmans, Dale and Szepesvari, Csaba}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {8413--8438}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/xiao22b/xiao22b.pdf}, url = {https://proceedings.mlr.press/v151/xiao22b.html}, abstract = { In high stake applications, active experimentation may be considered too risky and thus data are often collected passively. While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states. The main focus of the current paper is the characterization of this price. For example, when learning in episodic finite state-action Markov decision processes (MDPs) with $S$ states and $A$ actions, we show that even with the best (but passively chosen) logging policy, $\Omega(A^{\min(\rS-1, H)}/\varepsilon^2)$ episodes are necessary (and sufficient) to obtain an $\epsilon$-optimal policy, where $H$ is the length of episodes. Note that this shows that the sample complexity blows up exponentially compared to the case of active data collection, a result which is not unexpected, but, as far as we know, have not been published beforehand and perhaps the form of the exact expression is a little surprising. We also extend these results in various directions, such as other criteria or learning in the presence of function approximation, with similar conclusions. A remarkable feature of our result is the sharp characterization of the exponent that appears, which is critical for understanding what makes passive learning hard. } }
Endnote
%0 Conference Paper %T The Curse of Passive Data Collection in Batch Reinforcement Learning %A Chenjun Xiao %A Ilbin Lee %A Bo Dai %A Dale Schuurmans %A Csaba Szepesvari %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-xiao22b %I PMLR %P 8413--8438 %U https://proceedings.mlr.press/v151/xiao22b.html %V 151 %X In high stake applications, active experimentation may be considered too risky and thus data are often collected passively. While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states. The main focus of the current paper is the characterization of this price. For example, when learning in episodic finite state-action Markov decision processes (MDPs) with $S$ states and $A$ actions, we show that even with the best (but passively chosen) logging policy, $\Omega(A^{\min(\rS-1, H)}/\varepsilon^2)$ episodes are necessary (and sufficient) to obtain an $\epsilon$-optimal policy, where $H$ is the length of episodes. Note that this shows that the sample complexity blows up exponentially compared to the case of active data collection, a result which is not unexpected, but, as far as we know, have not been published beforehand and perhaps the form of the exact expression is a little surprising. We also extend these results in various directions, such as other criteria or learning in the presence of function approximation, with similar conclusions. A remarkable feature of our result is the sharp characterization of the exponent that appears, which is critical for understanding what makes passive learning hard.
APA
Xiao, C., Lee, I., Dai, B., Schuurmans, D. & Szepesvari, C.. (2022). The Curse of Passive Data Collection in Batch Reinforcement Learning . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:8413-8438 Available from https://proceedings.mlr.press/v151/xiao22b.html.

Related Material