Multi-Modal Classifiers for Open-Vocabulary Object Detection

Prannay Kaul, Weidi Xie, Andrew Zisserman
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:15946-15969, 2023.

Abstract

The goal of this paper is open-vocabulary object detection (OVOD) — building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two- stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yield- ing a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary bench- mark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-kaul23a, title = {Multi-Modal Classifiers for Open-Vocabulary Object Detection}, author = {Kaul, Prannay and Xie, Weidi and Zisserman, Andrew}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {15946--15969}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/kaul23a/kaul23a.pdf}, url = {https://proceedings.mlr.press/v202/kaul23a.html}, abstract = {The goal of this paper is open-vocabulary object detection (OVOD) — building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two- stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yield- ing a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary bench- mark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.} }
Endnote
%0 Conference Paper %T Multi-Modal Classifiers for Open-Vocabulary Object Detection %A Prannay Kaul %A Weidi Xie %A Andrew Zisserman %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-kaul23a %I PMLR %P 15946--15969 %U https://proceedings.mlr.press/v202/kaul23a.html %V 202 %X The goal of this paper is open-vocabulary object detection (OVOD) — building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two- stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yield- ing a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary bench- mark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.
APA
Kaul, P., Xie, W. & Zisserman, A.. (2023). Multi-Modal Classifiers for Open-Vocabulary Object Detection. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:15946-15969 Available from https://proceedings.mlr.press/v202/kaul23a.html.

Related Material