Understanding the Training Speedup from Sampling with Approximate Losses

Rudrajit Das, Xi Chen, Bertram Ieong, Parikshit Bansal, Sujay Sanghavi
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:10127-10147, 2024.

Abstract

It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large approximate losses instead of exact losses in order to reduce the selection overhead. For smooth convex losses, we show that such a greedy strategy can converge to a constant factor of the minimum value of the average loss in fewer iterations than the standard approach of random selection. We also theoretically quantify the effect of the approximation level. We then develop SIFT which uses early exiting to obtain approximate losses with an intermediate layer’s representations for sample selection. We evaluate SIFT on the task of training a 110M parameter 12 layer BERT base model, and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes $\sim$ 43 hours compared to $\sim$ 57 hours of vanilla training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-das24b, title = {Understanding the Training Speedup from Sampling with Approximate Losses}, author = {Das, Rudrajit and Chen, Xi and Ieong, Bertram and Bansal, Parikshit and Sanghavi, Sujay}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {10127--10147}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/das24b/das24b.pdf}, url = {https://proceedings.mlr.press/v235/das24b.html}, abstract = {It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large approximate losses instead of exact losses in order to reduce the selection overhead. For smooth convex losses, we show that such a greedy strategy can converge to a constant factor of the minimum value of the average loss in fewer iterations than the standard approach of random selection. We also theoretically quantify the effect of the approximation level. We then develop SIFT which uses early exiting to obtain approximate losses with an intermediate layer’s representations for sample selection. We evaluate SIFT on the task of training a 110M parameter 12 layer BERT base model, and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes $\sim$ 43 hours compared to $\sim$ 57 hours of vanilla training.} }
Endnote
%0 Conference Paper %T Understanding the Training Speedup from Sampling with Approximate Losses %A Rudrajit Das %A Xi Chen %A Bertram Ieong %A Parikshit Bansal %A Sujay Sanghavi %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-das24b %I PMLR %P 10127--10147 %U https://proceedings.mlr.press/v235/das24b.html %V 235 %X It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large approximate losses instead of exact losses in order to reduce the selection overhead. For smooth convex losses, we show that such a greedy strategy can converge to a constant factor of the minimum value of the average loss in fewer iterations than the standard approach of random selection. We also theoretically quantify the effect of the approximation level. We then develop SIFT which uses early exiting to obtain approximate losses with an intermediate layer’s representations for sample selection. We evaluate SIFT on the task of training a 110M parameter 12 layer BERT base model, and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes $\sim$ 43 hours compared to $\sim$ 57 hours of vanilla training.
APA
Das, R., Chen, X., Ieong, B., Bansal, P. & Sanghavi, S.. (2024). Understanding the Training Speedup from Sampling with Approximate Losses. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:10127-10147 Available from https://proceedings.mlr.press/v235/das24b.html.

Related Material