Learning to Navigate Using Mid-Level Visual Priors

Alexander Sax, Jeffrey O. Zhang, Bradley Emi, Amir Zamir, Silvio Savarese, Leonidas Guibas, Jitendra Malik
; Proceedings of the Conference on Robot Learning, PMLR 100:791-812, 2020.

Abstract

How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement learning framework (see Fig. 1). This skill set (“mid-level vision”) provides the policy with a more processed state of the world compared to raw images. Our large-scale study demonstrates that using mid-level vision results in policies that learn faster, generalize better, and achieve higher final performance, when compared to learning from scratch and/or using state-of-the-art visual and non-visual representation learning methods. We show that conventional computer vision objectives are particularly effective in this regard and can be conveniently integrated into reinforcement learning frameworks. Finally, we found that no single visual representation was universally useful for all downstream tasks, hence we computationally derive a task-agnostic set of representations optimized to support arbitrary downstream tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v100-sax20a, title = {Learning to Navigate Using Mid-Level Visual Priors}, author = {Sax, Alexander and Zhang, Jeffrey O. and Emi, Bradley and Zamir, Amir and Savarese, Silvio and Guibas, Leonidas and Malik, Jitendra}, booktitle = {Proceedings of the Conference on Robot Learning}, pages = {791--812}, year = {2020}, editor = {Leslie Pack Kaelbling and Danica Kragic and Komei Sugiura}, volume = {100}, series = {Proceedings of Machine Learning Research}, address = {}, month = {30 Oct--01 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v100/sax20a/sax20a.pdf}, url = {http://proceedings.mlr.press/v100/sax20a.html}, abstract = {How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement learning framework (see Fig. 1). This skill set (“mid-level vision”) provides the policy with a more processed state of the world compared to raw images. Our large-scale study demonstrates that using mid-level vision results in policies that learn faster, generalize better, and achieve higher final performance, when compared to learning from scratch and/or using state-of-the-art visual and non-visual representation learning methods. We show that conventional computer vision objectives are particularly effective in this regard and can be conveniently integrated into reinforcement learning frameworks. Finally, we found that no single visual representation was universally useful for all downstream tasks, hence we computationally derive a task-agnostic set of representations optimized to support arbitrary downstream tasks.} }
Endnote
%0 Conference Paper %T Learning to Navigate Using Mid-Level Visual Priors %A Alexander Sax %A Jeffrey O. Zhang %A Bradley Emi %A Amir Zamir %A Silvio Savarese %A Leonidas Guibas %A Jitendra Malik %B Proceedings of the Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2020 %E Leslie Pack Kaelbling %E Danica Kragic %E Komei Sugiura %F pmlr-v100-sax20a %I PMLR %J Proceedings of Machine Learning Research %P 791--812 %U http://proceedings.mlr.press %V 100 %W PMLR %X How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement learning framework (see Fig. 1). This skill set (“mid-level vision”) provides the policy with a more processed state of the world compared to raw images. Our large-scale study demonstrates that using mid-level vision results in policies that learn faster, generalize better, and achieve higher final performance, when compared to learning from scratch and/or using state-of-the-art visual and non-visual representation learning methods. We show that conventional computer vision objectives are particularly effective in this regard and can be conveniently integrated into reinforcement learning frameworks. Finally, we found that no single visual representation was universally useful for all downstream tasks, hence we computationally derive a task-agnostic set of representations optimized to support arbitrary downstream tasks.
APA
Sax, A., Zhang, J.O., Emi, B., Zamir, A., Savarese, S., Guibas, L. & Malik, J.. (2020). Learning to Navigate Using Mid-Level Visual Priors. Proceedings of the Conference on Robot Learning, in PMLR 100:791-812

Related Material