On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Nghiem Tuong Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:13713-13745, 2025.

Abstract

LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-diep25a, title = {On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation}, author = {Diep, Nghiem Tuong and Nguyen, Huy and Nguyen, Chau and Le, Minh and Nguyen, Duy Minh Ho and Sonntag, Daniel and Niepert, Mathias and Ho, Nhat}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {13713--13745}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/diep25a/diep25a.pdf}, url = {https://proceedings.mlr.press/v267/diep25a.html}, abstract = {LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.} }
Endnote
%0 Conference Paper %T On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation %A Nghiem Tuong Diep %A Huy Nguyen %A Chau Nguyen %A Minh Le %A Duy Minh Ho Nguyen %A Daniel Sonntag %A Mathias Niepert %A Nhat Ho %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-diep25a %I PMLR %P 13713--13745 %U https://proceedings.mlr.press/v267/diep25a.html %V 267 %X LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.
APA
Diep, N.T., Nguyen, H., Nguyen, C., Le, M., Nguyen, D.M.H., Sonntag, D., Niepert, M. & Ho, N.. (2025). On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:13713-13745 Available from https://proceedings.mlr.press/v267/diep25a.html.

Related Material