VIRL: Self-Supervised Visual Graph Inverse Reinforcement Learning

Lei Huang, Weijia Cai, Zihan Zhu, Chen Feng, Helge Rhodin, Zhengbo Zou
Proceedings of The 8th Conference on Robot Learning, PMLR 270:2029-2048, 2025.

Abstract

Learning dense reward functions from unlabeled videos for reinforcement learning exhibits scalability due to the vast diversity and quantity of video resources. Recent works use visual features or graph abstractions in videos to measure task progress as rewards, which either deteriorate in unseen domains or capture spatial information while overlooking visual details. We propose Visual-Graph Inverse Reinforcement Learning (VIRL), a self-supervised method that synergizes low-level visual features and high-level graph abstractions from frames to graph representations for reward learning. VIRL utilizes a visual encoder that extracts object-wise features for graph nodes and a graph encoder that derives properties from graphs constructed from detected objects in each frame. The encoded representations are enforced to align videos temporally and reconstruct in-scene objects. The pretrained visual graph encoder is then utilized to construct a dense reward function for policy learning by measuring latent distances between current frames and the goal frame. Our empirical evaluation on the X-MAGICAL and Robot Visual Pusher benchmark demonstrates that VIRL effectively handles tasks necessitating both granular visual attention and broader global feature consideration, and exhibits robust generalization to extrapolation tasks and domains not seen in demonstrations. Our policy for the robotic task also achieves the highest success rate in real-world robot experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-huang25d, title = {VIRL: Self-Supervised Visual Graph Inverse Reinforcement Learning}, author = {Huang, Lei and Cai, Weijia and Zhu, Zihan and Feng, Chen and Rhodin, Helge and Zou, Zhengbo}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {2029--2048}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/huang25d/huang25d.pdf}, url = {https://proceedings.mlr.press/v270/huang25d.html}, abstract = {Learning dense reward functions from unlabeled videos for reinforcement learning exhibits scalability due to the vast diversity and quantity of video resources. Recent works use visual features or graph abstractions in videos to measure task progress as rewards, which either deteriorate in unseen domains or capture spatial information while overlooking visual details. We propose $\textbf{V}$isual-Graph $\textbf{I}$nverse $\textbf{R}$einforcement $\textbf{L}$earning (VIRL), a self-supervised method that synergizes low-level visual features and high-level graph abstractions from frames to graph representations for reward learning. VIRL utilizes a visual encoder that extracts object-wise features for graph nodes and a graph encoder that derives properties from graphs constructed from detected objects in each frame. The encoded representations are enforced to align videos temporally and reconstruct in-scene objects. The pretrained visual graph encoder is then utilized to construct a dense reward function for policy learning by measuring latent distances between current frames and the goal frame. Our empirical evaluation on the X-MAGICAL and Robot Visual Pusher benchmark demonstrates that VIRL effectively handles tasks necessitating both granular visual attention and broader global feature consideration, and exhibits robust generalization to $\textit{extrapolation}$ tasks and domains not seen in demonstrations. Our policy for the robotic task also achieves the highest success rate in real-world robot experiments.} }
Endnote
%0 Conference Paper %T VIRL: Self-Supervised Visual Graph Inverse Reinforcement Learning %A Lei Huang %A Weijia Cai %A Zihan Zhu %A Chen Feng %A Helge Rhodin %A Zhengbo Zou %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-huang25d %I PMLR %P 2029--2048 %U https://proceedings.mlr.press/v270/huang25d.html %V 270 %X Learning dense reward functions from unlabeled videos for reinforcement learning exhibits scalability due to the vast diversity and quantity of video resources. Recent works use visual features or graph abstractions in videos to measure task progress as rewards, which either deteriorate in unseen domains or capture spatial information while overlooking visual details. We propose $\textbf{V}$isual-Graph $\textbf{I}$nverse $\textbf{R}$einforcement $\textbf{L}$earning (VIRL), a self-supervised method that synergizes low-level visual features and high-level graph abstractions from frames to graph representations for reward learning. VIRL utilizes a visual encoder that extracts object-wise features for graph nodes and a graph encoder that derives properties from graphs constructed from detected objects in each frame. The encoded representations are enforced to align videos temporally and reconstruct in-scene objects. The pretrained visual graph encoder is then utilized to construct a dense reward function for policy learning by measuring latent distances between current frames and the goal frame. Our empirical evaluation on the X-MAGICAL and Robot Visual Pusher benchmark demonstrates that VIRL effectively handles tasks necessitating both granular visual attention and broader global feature consideration, and exhibits robust generalization to $\textit{extrapolation}$ tasks and domains not seen in demonstrations. Our policy for the robotic task also achieves the highest success rate in real-world robot experiments.
APA
Huang, L., Cai, W., Zhu, Z., Feng, C., Rhodin, H. & Zou, Z.. (2025). VIRL: Self-Supervised Visual Graph Inverse Reinforcement Learning. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:2029-2048 Available from https://proceedings.mlr.press/v270/huang25d.html.

Related Material