ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:49545-49557, 2024.

Abstract

Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-vishniakov24a, title = {{C}onv{N}et vs Transformer, Supervised vs {CLIP}: Beyond {I}mage{N}et Accuracy}, author = {Vishniakov, Kirill and Shen, Zhiqiang and Liu, Zhuang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {49545--49557}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/vishniakov24a/vishniakov24a.pdf}, url = {https://proceedings.mlr.press/v235/vishniakov24a.html}, abstract = {Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models.} }
Endnote
%0 Conference Paper %T ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy %A Kirill Vishniakov %A Zhiqiang Shen %A Zhuang Liu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-vishniakov24a %I PMLR %P 49545--49557 %U https://proceedings.mlr.press/v235/vishniakov24a.html %V 235 %X Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models.
APA
Vishniakov, K., Shen, Z. & Liu, Z.. (2024). ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:49545-49557 Available from https://proceedings.mlr.press/v235/vishniakov24a.html.

Related Material