SpotEM: Efficient Video Search for Episodic Memory

Santhosh Kumar Ramakrishnan; Ziad Al-Halah; Kristen Grauman

SpotEM: Efficient Video Search for Episodic Memory

Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:28618-28636, 2023.

Abstract

The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., “where did I leave my purse?”). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% – 25% of the clip features, we preserve 84% – 97% of the original EM model’s accuracy. Project page: https://vision.cs.utexas.edu/projects/spotem

Cite this Paper

BibTeX

@InProceedings{pmlr-v202-ramakrishnan23a,
  title = 	 {{S}pot{EM}: Efficient Video Search for Episodic Memory},
  author =       {Ramakrishnan, Santhosh Kumar and Al-Halah, Ziad and Grauman, Kristen},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {28618--28636},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/ramakrishnan23a/ramakrishnan23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/ramakrishnan23a.html},
  abstract = 	 {The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., “where did I leave my purse?”). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% – 25% of the clip features, we preserve 84% – 97% of the original EM model’s accuracy. Project page: https://vision.cs.utexas.edu/projects/spotem}
}

Endnote

%0 Conference Paper
%T SpotEM: Efficient Video Search for Episodic Memory
%A Santhosh Kumar Ramakrishnan
%A Ziad Al-Halah
%A Kristen Grauman
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-ramakrishnan23a
%I PMLR
%P 28618--28636
%U https://proceedings.mlr.press/v202/ramakrishnan23a.html
%V 202
%X The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., “where did I leave my purse?”). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% – 25% of the clip features, we preserve 84% – 97% of the original EM model’s accuracy. Project page: https://vision.cs.utexas.edu/projects/spotem

APA

Ramakrishnan, S.K., Al-Halah, Z. & Grauman, K.. (2023). SpotEM: Efficient Video Search for Episodic Memory. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:28618-28636 Available from https://proceedings.mlr.press/v202/ramakrishnan23a.html.

SpotEM: Efficient Video Search for Episodic Memory

Abstract

Cite this Paper

Related Material