Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:1416-1429, 2023.

Abstract

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8% with a ViT-L model trained for 150 epochs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-baevski23a, title = {Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language}, author = {Baevski, Alexei and Babu, Arun and Hsu, Wei-Ning and Auli, Michael}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {1416--1429}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/baevski23a/baevski23a.pdf}, url = {https://proceedings.mlr.press/v202/baevski23a.html}, abstract = {Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8% with a ViT-L model trained for 150 epochs.} }
Endnote
%0 Conference Paper %T Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language %A Alexei Baevski %A Arun Babu %A Wei-Ning Hsu %A Michael Auli %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-baevski23a %I PMLR %P 1416--1429 %U https://proceedings.mlr.press/v202/baevski23a.html %V 202 %X Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8% with a ViT-L model trained for 150 epochs.
APA
Baevski, A., Babu, A., Hsu, W. & Auli, M.. (2023). Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:1416-1429 Available from https://proceedings.mlr.press/v202/baevski23a.html.

Related Material