Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

Mingyu Ding; Yan Xu; Zhenfang Chen; David Daniel Cox; Ping Luo; Joshua B. Tenenbaum; Chuang Gan

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

Proceedings of The 6th Conference on Robot Learning, PMLR 205:1743-1754, 2023.

Abstract

Humans, even at a very early age, can learn visual concepts and understand geometry and layout through active interaction with the environment, and generalize their compositions to complete tasks described by natural languages in novel scenes. To mimic such capability, we propose Embodied Concept Learner (ECL) in an interactive 3D environment. Specifically, a robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks by learning purely from human demonstrations and language instructions, without access to ground-truth semantic and depth supervision from simulations. ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program. ECL has several appealing benefits thanks to its modularized design. Firstly, it enables the robotic agent to learn semantics and depth unsupervisedly acting like babies, e.g., ground concepts through active interaction and perceive depth by disparities when moving forward. Secondly, ECL is fully transparent and step-by-step interpretable in long-term planning. Thirdly, ECL could be beneficial for the embodied instruction following (EIF), outperforming previous works on the ALFRED benchmark when the semantic label is not provided. Also, the learned concept can be reused for other downstream tasks, such as reasoning of object states.

Cite this Paper

BibTeX


@InProceedings{pmlr-v205-ding23b,
  title = 	 {Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following},
  author =       {Ding, Mingyu and Xu, Yan and Chen, Zhenfang and Cox, David Daniel and Luo, Ping and Tenenbaum, Joshua B. and Gan, Chuang},
  booktitle = 	 {Proceedings of The 6th Conference on Robot Learning},
  pages = 	 {1743--1754},
  year = 	 {2023},
  editor = 	 {Liu, Karen and Kulic, Dana and Ichnowski, Jeff},
  volume = 	 {205},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--18 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v205/ding23b/ding23b.pdf},
  url = 	 {https://proceedings.mlr.press/v205/ding23b.html},
  abstract = 	 {Humans, even at a very early age, can learn visual concepts and understand geometry and layout through active interaction with the environment, and generalize their compositions to complete tasks described by natural languages in novel scenes. To mimic such capability, we propose Embodied Concept Learner (ECL) in an interactive 3D environment. Specifically, a robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks by learning purely from human demonstrations and language instructions, without access to ground-truth semantic and depth supervision from simulations. ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program. ECL has several appealing benefits thanks to its modularized design. Firstly, it enables the robotic agent to learn semantics and depth unsupervisedly acting like babies, e.g., ground concepts through active interaction and perceive depth by disparities when moving forward. Secondly, ECL is fully transparent and step-by-step interpretable in long-term planning. Thirdly, ECL could be beneficial for the embodied instruction following (EIF), outperforming previous works on the ALFRED benchmark when the semantic label is not provided. Also, the learned concept can be reused for other downstream tasks, such as reasoning of object states.}
}

Endnote

%0 Conference Paper
%T Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following
%A Mingyu Ding
%A Yan Xu
%A Zhenfang Chen
%A David Daniel Cox
%A Ping Luo
%A Joshua B. Tenenbaum
%A Chuang Gan
%B Proceedings of The 6th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Karen Liu
%E Dana Kulic
%E Jeff Ichnowski	
%F pmlr-v205-ding23b
%I PMLR
%P 1743--1754
%U https://proceedings.mlr.press/v205/ding23b.html
%V 205
%X Humans, even at a very early age, can learn visual concepts and understand geometry and layout through active interaction with the environment, and generalize their compositions to complete tasks described by natural languages in novel scenes. To mimic such capability, we propose Embodied Concept Learner (ECL) in an interactive 3D environment. Specifically, a robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks by learning purely from human demonstrations and language instructions, without access to ground-truth semantic and depth supervision from simulations. ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program. ECL has several appealing benefits thanks to its modularized design. Firstly, it enables the robotic agent to learn semantics and depth unsupervisedly acting like babies, e.g., ground concepts through active interaction and perceive depth by disparities when moving forward. Secondly, ECL is fully transparent and step-by-step interpretable in long-term planning. Thirdly, ECL could be beneficial for the embodied instruction following (EIF), outperforming previous works on the ALFRED benchmark when the semantic label is not provided. Also, the learned concept can be reused for other downstream tasks, such as reasoning of object states.

APA


Ding, M., Xu, Y., Chen, Z., Cox, D.D., Luo, P., Tenenbaum, J.B. & Gan, C.. (2023). Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following. Proceedings of The 6th Conference on Robot Learning, in Proceedings of Machine Learning Research 205:1743-1754 Available from https://proceedings.mlr.press/v205/ding23b.html.

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

Abstract

Cite this Paper

Related Material