The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, Lei Wu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:64859-64879, 2025.

Abstract

Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important. In this paper, we uncover a clear sharpness disparity across these blocks, which intriguingly emerges early in training and persists throughout the training process. Building on this insight, we propose a novel Blockwise Learning Rate (LR) strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wang25dl, title = {The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training}, author = {Wang, Jinbo and Wang, Mingze and Zhou, Zhanpeng and Yan, Junchi and E, Weinan and Wu, Lei}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {64859--64879}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wang25dl/wang25dl.pdf}, url = {https://proceedings.mlr.press/v267/wang25dl.html}, abstract = {Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important. In this paper, we uncover a clear sharpness disparity across these blocks, which intriguingly emerges early in training and persists throughout the training process. Building on this insight, we propose a novel Blockwise Learning Rate (LR) strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.} }
Endnote
%0 Conference Paper %T The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training %A Jinbo Wang %A Mingze Wang %A Zhanpeng Zhou %A Junchi Yan %A Weinan E %A Lei Wu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wang25dl %I PMLR %P 64859--64879 %U https://proceedings.mlr.press/v267/wang25dl.html %V 267 %X Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important. In this paper, we uncover a clear sharpness disparity across these blocks, which intriguingly emerges early in training and persists throughout the training process. Building on this insight, we propose a novel Blockwise Learning Rate (LR) strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.
APA
Wang, J., Wang, M., Zhou, Z., Yan, J., E, W. & Wu, L.. (2025). The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:64859-64879 Available from https://proceedings.mlr.press/v267/wang25dl.html.

Related Material