Provable Interactive Learning with Hindsight Instruction Feedback

Dipendra Misra, Aldo Pacchiano, Robert E. Schapire
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:35829-35850, 2024.

Abstract

We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight labeling where a teacher provides an instruction that is most suitable for the agent’s generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent’s response space. Next, we study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that it is a no-regret algorithm with the regret scaling with $\sqrt{T}$ and depends on the intrinsic rank but does not depend on the agent’s response space. We provide experiments showing the performance of LORIL in practice for 2 domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-misra24a, title = {Provable Interactive Learning with Hindsight Instruction Feedback}, author = {Misra, Dipendra and Pacchiano, Aldo and Schapire, Robert E.}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {35829--35850}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/misra24a/misra24a.pdf}, url = {https://proceedings.mlr.press/v235/misra24a.html}, abstract = {We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight labeling where a teacher provides an instruction that is most suitable for the agent’s generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent’s response space. Next, we study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that it is a no-regret algorithm with the regret scaling with $\sqrt{T}$ and depends on the intrinsic rank but does not depend on the agent’s response space. We provide experiments showing the performance of LORIL in practice for 2 domains.} }
Endnote
%0 Conference Paper %T Provable Interactive Learning with Hindsight Instruction Feedback %A Dipendra Misra %A Aldo Pacchiano %A Robert E. Schapire %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-misra24a %I PMLR %P 35829--35850 %U https://proceedings.mlr.press/v235/misra24a.html %V 235 %X We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight labeling where a teacher provides an instruction that is most suitable for the agent’s generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent’s response space. Next, we study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that it is a no-regret algorithm with the regret scaling with $\sqrt{T}$ and depends on the intrinsic rank but does not depend on the agent’s response space. We provide experiments showing the performance of LORIL in practice for 2 domains.
APA
Misra, D., Pacchiano, A. & Schapire, R.E.. (2024). Provable Interactive Learning with Hindsight Instruction Feedback. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:35829-35850 Available from https://proceedings.mlr.press/v235/misra24a.html.

Related Material