CLOVER: Cross-Layer Orthogonal Vectors Pruning

Fanxu Meng, Pingzhi Tang, Fan Jiang, Muhan Zhang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:43807-43825, 2025.

Abstract

Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bounded. To address this challenge, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel approach that treats pairs of components of the attention mechanism as low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K and V-O pairs within each attention head. The resulting singular values, in turn, guide pruning and further serve as trainable parameters for efficient fine-tuning, ultimately enabling the model to recover its performance to the level before pruning.After pruning and fine-tuning, these values are reintegrated into the model without increasing its parameter count. Visualizations across various models show that CLOVER effectively removes linear redundancies within attention heads, greatly improving pruning efficiency. For example, pruning 70% of the Q-K head dimension in GPT-2 XL results in a perplexity comparable to that of pruning just 8% using vanilla pruning. The combination of CLOVER and TransMLA achieves a speedup of up to 11.1$\times$ over LLaMA-2-7B.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-meng25d, title = {{CLOVER}: Cross-Layer Orthogonal Vectors Pruning}, author = {Meng, Fanxu and Tang, Pingzhi and Jiang, Fan and Zhang, Muhan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {43807--43825}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/meng25d/meng25d.pdf}, url = {https://proceedings.mlr.press/v267/meng25d.html}, abstract = {Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bounded. To address this challenge, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel approach that treats pairs of components of the attention mechanism as low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K and V-O pairs within each attention head. The resulting singular values, in turn, guide pruning and further serve as trainable parameters for efficient fine-tuning, ultimately enabling the model to recover its performance to the level before pruning.After pruning and fine-tuning, these values are reintegrated into the model without increasing its parameter count. Visualizations across various models show that CLOVER effectively removes linear redundancies within attention heads, greatly improving pruning efficiency. For example, pruning 70% of the Q-K head dimension in GPT-2 XL results in a perplexity comparable to that of pruning just 8% using vanilla pruning. The combination of CLOVER and TransMLA achieves a speedup of up to 11.1$\times$ over LLaMA-2-7B.} }
Endnote
%0 Conference Paper %T CLOVER: Cross-Layer Orthogonal Vectors Pruning %A Fanxu Meng %A Pingzhi Tang %A Fan Jiang %A Muhan Zhang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-meng25d %I PMLR %P 43807--43825 %U https://proceedings.mlr.press/v267/meng25d.html %V 267 %X Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bounded. To address this challenge, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel approach that treats pairs of components of the attention mechanism as low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K and V-O pairs within each attention head. The resulting singular values, in turn, guide pruning and further serve as trainable parameters for efficient fine-tuning, ultimately enabling the model to recover its performance to the level before pruning.After pruning and fine-tuning, these values are reintegrated into the model without increasing its parameter count. Visualizations across various models show that CLOVER effectively removes linear redundancies within attention heads, greatly improving pruning efficiency. For example, pruning 70% of the Q-K head dimension in GPT-2 XL results in a perplexity comparable to that of pruning just 8% using vanilla pruning. The combination of CLOVER and TransMLA achieves a speedup of up to 11.1$\times$ over LLaMA-2-7B.
APA
Meng, F., Tang, P., Jiang, F. & Zhang, M.. (2025). CLOVER: Cross-Layer Orthogonal Vectors Pruning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:43807-43825 Available from https://proceedings.mlr.press/v267/meng25d.html.

Related Material