End-to-End Learning of Semantic Grasping

Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, Sergey Levine
; Proceedings of the 1st Annual Conference on Robot Learning, PMLR 78:119-132, 2017.

Abstract

We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A “ventral stream” recognizes object class while a “dorsal stream” simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach improves upon grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v78-jang17a, title = {End-to-End Learning of Semantic Grasping}, author = {Eric Jang and Sudheendra Vijayanarasimhan and Peter Pastor and Julian Ibarz and Sergey Levine}, booktitle = {Proceedings of the 1st Annual Conference on Robot Learning}, pages = {119--132}, year = {2017}, editor = {Sergey Levine and Vincent Vanhoucke and Ken Goldberg}, volume = {78}, series = {Proceedings of Machine Learning Research}, month = {13--15 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v78/jang17a/jang17a.pdf}, url = {http://proceedings.mlr.press/v78/jang17a.html}, abstract = {We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A “ventral stream” recognizes object class while a “dorsal stream” simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach improves upon grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance.} }
Endnote
%0 Conference Paper %T End-to-End Learning of Semantic Grasping %A Eric Jang %A Sudheendra Vijayanarasimhan %A Peter Pastor %A Julian Ibarz %A Sergey Levine %B Proceedings of the 1st Annual Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2017 %E Sergey Levine %E Vincent Vanhoucke %E Ken Goldberg %F pmlr-v78-jang17a %I PMLR %J Proceedings of Machine Learning Research %P 119--132 %U http://proceedings.mlr.press %V 78 %W PMLR %X We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A “ventral stream” recognizes object class while a “dorsal stream” simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach improves upon grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance.
APA
Jang, E., Vijayanarasimhan, S., Pastor, P., Ibarz, J. & Levine, S.. (2017). End-to-End Learning of Semantic Grasping. Proceedings of the 1st Annual Conference on Robot Learning, in PMLR 78:119-132

Related Material