Vision-Language Models as Success Detectors

Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, Serkan Cabi
Proceedings of The 2nd Conference on Lifelong Learning Agents, PMLR 232:120-136, 2023.

Abstract

Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) “in-the-wild” human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of “in-the-wild” human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our results encourage further work in real world success detection and reward modelling with pretrained vision-language models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v232-du23b, title = {Vision-Language Models as Success Detectors}, author = {Du, Yuqing and Konyushkova, Ksenia and Denil, Misha and Raju, Akhil and Landon, Jessica and Hill, Felix and de Freitas, Nando and Cabi, Serkan}, booktitle = {Proceedings of The 2nd Conference on Lifelong Learning Agents}, pages = {120--136}, year = {2023}, editor = {Chandar, Sarath and Pascanu, Razvan and Sedghi, Hanie and Precup, Doina}, volume = {232}, series = {Proceedings of Machine Learning Research}, month = {22--25 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v232/du23b/du23b.pdf}, url = {https://proceedings.mlr.press/v232/du23b.html}, abstract = {Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) “in-the-wild” human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of “in-the-wild” human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our results encourage further work in real world success detection and reward modelling with pretrained vision-language models.} }
Endnote
%0 Conference Paper %T Vision-Language Models as Success Detectors %A Yuqing Du %A Ksenia Konyushkova %A Misha Denil %A Akhil Raju %A Jessica Landon %A Felix Hill %A Nando de Freitas %A Serkan Cabi %B Proceedings of The 2nd Conference on Lifelong Learning Agents %C Proceedings of Machine Learning Research %D 2023 %E Sarath Chandar %E Razvan Pascanu %E Hanie Sedghi %E Doina Precup %F pmlr-v232-du23b %I PMLR %P 120--136 %U https://proceedings.mlr.press/v232/du23b.html %V 232 %X Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) “in-the-wild” human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of “in-the-wild” human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our results encourage further work in real world success detection and reward modelling with pretrained vision-language models.
APA
Du, Y., Konyushkova, K., Denil, M., Raju, A., Landon, J., Hill, F., de Freitas, N. & Cabi, S.. (2023). Vision-Language Models as Success Detectors. Proceedings of The 2nd Conference on Lifelong Learning Agents, in Proceedings of Machine Learning Research 232:120-136 Available from https://proceedings.mlr.press/v232/du23b.html.

Related Material