STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment

Jaewoo Lee, Jaehong Yoon, Wonjae Kim, Yunji Kim, Sung Ju Hwang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:27094-27117, 2024.

Abstract

Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a $3.69%$p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by $\sim 45 %$.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-lee24ac, title = {{STELLA}: Continual Audio-Video Pre-training with {S}patio{T}emporal Localized Alignment}, author = {Lee, Jaewoo and Yoon, Jaehong and Kim, Wonjae and Kim, Yunji and Hwang, Sung Ju}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {27094--27117}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/lee24ac/lee24ac.pdf}, url = {https://proceedings.mlr.press/v235/lee24ac.html}, abstract = {Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a $3.69%$p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by $\sim 45 %$.} }
Endnote
%0 Conference Paper %T STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment %A Jaewoo Lee %A Jaehong Yoon %A Wonjae Kim %A Yunji Kim %A Sung Ju Hwang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-lee24ac %I PMLR %P 27094--27117 %U https://proceedings.mlr.press/v235/lee24ac.html %V 235 %X Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a $3.69%$p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by $\sim 45 %$.
APA
Lee, J., Yoon, J., Kim, W., Kim, Y. & Hwang, S.J.. (2024). STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:27094-27117 Available from https://proceedings.mlr.press/v235/lee24ac.html.

Related Material