Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation

Junyu Gao, Xuan Yao, Changsheng Xu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:14902-14919, 2024.

Abstract

The ability to accurately comprehend natural language instructions and navigate to the target location is essential for an embodied agent. Such agents are typically required to execute user instructions in an online manner, leading us to explore the use of unlabeled test samples for effective online model adaptation. However, for online Vision-and-Language Navigation (VLN), due to the intrinsic nature of inter-sample online instruction execution and intra-sample multi-step action decision, frequent updates can result in drastic changes in model parameters, while occasional updates can make the model ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online VLN by performing joint decomposition-accumulation analysis for both gradients and parameters in a unified framework. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks. Code is available at https://github.com/Feliciaxyao/ICML2024-FSTTA.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-gao24p, title = {Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation}, author = {Gao, Junyu and Yao, Xuan and Xu, Changsheng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {14902--14919}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/gao24p/gao24p.pdf}, url = {https://proceedings.mlr.press/v235/gao24p.html}, abstract = {The ability to accurately comprehend natural language instructions and navigate to the target location is essential for an embodied agent. Such agents are typically required to execute user instructions in an online manner, leading us to explore the use of unlabeled test samples for effective online model adaptation. However, for online Vision-and-Language Navigation (VLN), due to the intrinsic nature of inter-sample online instruction execution and intra-sample multi-step action decision, frequent updates can result in drastic changes in model parameters, while occasional updates can make the model ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online VLN by performing joint decomposition-accumulation analysis for both gradients and parameters in a unified framework. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks. Code is available at https://github.com/Feliciaxyao/ICML2024-FSTTA.} }
Endnote
%0 Conference Paper %T Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation %A Junyu Gao %A Xuan Yao %A Changsheng Xu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-gao24p %I PMLR %P 14902--14919 %U https://proceedings.mlr.press/v235/gao24p.html %V 235 %X The ability to accurately comprehend natural language instructions and navigate to the target location is essential for an embodied agent. Such agents are typically required to execute user instructions in an online manner, leading us to explore the use of unlabeled test samples for effective online model adaptation. However, for online Vision-and-Language Navigation (VLN), due to the intrinsic nature of inter-sample online instruction execution and intra-sample multi-step action decision, frequent updates can result in drastic changes in model parameters, while occasional updates can make the model ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online VLN by performing joint decomposition-accumulation analysis for both gradients and parameters in a unified framework. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks. Code is available at https://github.com/Feliciaxyao/ICML2024-FSTTA.
APA
Gao, J., Yao, X. & Xu, C.. (2024). Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:14902-14919 Available from https://proceedings.mlr.press/v235/gao24p.html.

Related Material