Mimicking Better by Matching the Approximate Action Distribution

Joao Candido Ramos, Lionel Blondé, Naoya Takeishi, Alexandros Kalousis
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:5513-5532, 2024.

Abstract

In this paper, we introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate reward signal, which can be derived from various sources such as adversarial games, trajectory matching objectives, or optimal transport criteria. To compensate for the non-availability of expert actions, we rely on an inverse dynamics model that infers plausible actions distribution given the expert’s state-state transitions; we regularize the imitator’s policy by aligning it to the inferred action distribution. MAAD leads to significantly improved sample efficiency and stability. We demonstrate its effectiveness in a number of MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods. Remarkably, MAAD often stands out as the sole method capable of attaining expert performance levels, underscoring its simplicity and efficacy.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-candido-ramos24a, title = {Mimicking Better by Matching the Approximate Action Distribution}, author = {Candido Ramos, Joao and Blond\'{e}, Lionel and Takeishi, Naoya and Kalousis, Alexandros}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {5513--5532}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/candido-ramos24a/candido-ramos24a.pdf}, url = {https://proceedings.mlr.press/v235/candido-ramos24a.html}, abstract = {In this paper, we introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate reward signal, which can be derived from various sources such as adversarial games, trajectory matching objectives, or optimal transport criteria. To compensate for the non-availability of expert actions, we rely on an inverse dynamics model that infers plausible actions distribution given the expert’s state-state transitions; we regularize the imitator’s policy by aligning it to the inferred action distribution. MAAD leads to significantly improved sample efficiency and stability. We demonstrate its effectiveness in a number of MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods. Remarkably, MAAD often stands out as the sole method capable of attaining expert performance levels, underscoring its simplicity and efficacy.} }
Endnote
%0 Conference Paper %T Mimicking Better by Matching the Approximate Action Distribution %A Joao Candido Ramos %A Lionel Blondé %A Naoya Takeishi %A Alexandros Kalousis %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-candido-ramos24a %I PMLR %P 5513--5532 %U https://proceedings.mlr.press/v235/candido-ramos24a.html %V 235 %X In this paper, we introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate reward signal, which can be derived from various sources such as adversarial games, trajectory matching objectives, or optimal transport criteria. To compensate for the non-availability of expert actions, we rely on an inverse dynamics model that infers plausible actions distribution given the expert’s state-state transitions; we regularize the imitator’s policy by aligning it to the inferred action distribution. MAAD leads to significantly improved sample efficiency and stability. We demonstrate its effectiveness in a number of MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods. Remarkably, MAAD often stands out as the sole method capable of attaining expert performance levels, underscoring its simplicity and efficacy.
APA
Candido Ramos, J., Blondé, L., Takeishi, N. & Kalousis, A.. (2024). Mimicking Better by Matching the Approximate Action Distribution. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:5513-5532 Available from https://proceedings.mlr.press/v235/candido-ramos24a.html.

Related Material