Partially Shared Query-Key for Lightweight Language Models

Kai Yang, Vahid Partovi Nia, Boxing Chen, Masoud Asgharian
Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:286-291, 2024.

Abstract

Lightweight language models, such as TinyBERT 14.5M, have emerged as a critical area of research because of their implementation on resource-constrained hardware. These transformer models include significantly smaller parameter size, reduced memory and computational requirements. These features make such models highly suitable for deployment on small devices. We explore the concept of parameter sharing between the key and query weight matrices of a transformer model. The full query-key sharing which has already been proposed in the literature introduces a fully-quadratic attention matrix, oversimplifies directional dependencies and degrades pre-training loss. In contrast, partial parameter sharing balances complexity reduction and performance retention. Partial parameter sharing effectively addresses over-fitting while maintaining strong performance even with a high degree of shared parameters up to 95%. This provides a promising strategy for enhancing language models, specifically targeting small models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v262-yang24a, title = {Partially Shared Query-Key for Lightweight Language Models}, author = {Yang, Kai and Partovi Nia, Vahid and Chen, Boxing and Asgharian, Masoud}, booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop}, pages = {286--291}, year = {2024}, editor = {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing}, volume = {262}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v262/main/assets/yang24a/yang24a.pdf}, url = {https://proceedings.mlr.press/v262/yang24a.html}, abstract = {Lightweight language models, such as TinyBERT 14.5M, have emerged as a critical area of research because of their implementation on resource-constrained hardware. These transformer models include significantly smaller parameter size, reduced memory and computational requirements. These features make such models highly suitable for deployment on small devices. We explore the concept of parameter sharing between the key and query weight matrices of a transformer model. The full query-key sharing which has already been proposed in the literature introduces a fully-quadratic attention matrix, oversimplifies directional dependencies and degrades pre-training loss. In contrast, partial parameter sharing balances complexity reduction and performance retention. Partial parameter sharing effectively addresses over-fitting while maintaining strong performance even with a high degree of shared parameters up to 95%. This provides a promising strategy for enhancing language models, specifically targeting small models.} }
Endnote
%0 Conference Paper %T Partially Shared Query-Key for Lightweight Language Models %A Kai Yang %A Vahid Partovi Nia %A Boxing Chen %A Masoud Asgharian %B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop %C Proceedings of Machine Learning Research %D 2024 %E Mehdi Rezagholizadeh %E Peyman Passban %E Soheila Samiee %E Vahid Partovi Nia %E Yu Cheng %E Yue Deng %E Qun Liu %E Boxing Chen %F pmlr-v262-yang24a %I PMLR %P 286--291 %U https://proceedings.mlr.press/v262/yang24a.html %V 262 %X Lightweight language models, such as TinyBERT 14.5M, have emerged as a critical area of research because of their implementation on resource-constrained hardware. These transformer models include significantly smaller parameter size, reduced memory and computational requirements. These features make such models highly suitable for deployment on small devices. We explore the concept of parameter sharing between the key and query weight matrices of a transformer model. The full query-key sharing which has already been proposed in the literature introduces a fully-quadratic attention matrix, oversimplifies directional dependencies and degrades pre-training loss. In contrast, partial parameter sharing balances complexity reduction and performance retention. Partial parameter sharing effectively addresses over-fitting while maintaining strong performance even with a high degree of shared parameters up to 95%. This provides a promising strategy for enhancing language models, specifically targeting small models.
APA
Yang, K., Partovi Nia, V., Chen, B. & Asgharian, M.. (2024). Partially Shared Query-Key for Lightweight Language Models. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:286-291 Available from https://proceedings.mlr.press/v262/yang24a.html.

Related Material