Zero-shot Video Moment Retrieval With Off-the-Shelf Models

Anuj Diwan, Puyuan Peng, Ray Mooney
Proceedings of The 1st Transfer Learning for Natural Language Processing Workshop, PMLR 203:10-21, 2023.

Abstract

For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v203-diwan23a, title = {Zero-shot Video Moment Retrieval With Off-the-Shelf Models}, author = {Diwan, Anuj and Peng, Puyuan and Mooney, Ray}, booktitle = {Proceedings of The 1st Transfer Learning for Natural Language Processing Workshop}, pages = {10--21}, year = {2023}, editor = {Albalak, Alon and Zhou, Chunting and Raffel, Colin and Ramachandran, Deepak and Ruder, Sebastian and Ma, Xuezhe}, volume = {203}, series = {Proceedings of Machine Learning Research}, month = {03 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v203/diwan23a/diwan23a.pdf}, url = {https://proceedings.mlr.press/v203/diwan23a.html}, abstract = {For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.} }
Endnote
%0 Conference Paper %T Zero-shot Video Moment Retrieval With Off-the-Shelf Models %A Anuj Diwan %A Puyuan Peng %A Ray Mooney %B Proceedings of The 1st Transfer Learning for Natural Language Processing Workshop %C Proceedings of Machine Learning Research %D 2023 %E Alon Albalak %E Chunting Zhou %E Colin Raffel %E Deepak Ramachandran %E Sebastian Ruder %E Xuezhe Ma %F pmlr-v203-diwan23a %I PMLR %P 10--21 %U https://proceedings.mlr.press/v203/diwan23a.html %V 203 %X For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.
APA
Diwan, A., Peng, P. & Mooney, R.. (2023). Zero-shot Video Moment Retrieval With Off-the-Shelf Models. Proceedings of The 1st Transfer Learning for Natural Language Processing Workshop, in Proceedings of Machine Learning Research 203:10-21 Available from https://proceedings.mlr.press/v203/diwan23a.html.

Related Material