A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Yeqi Gao; Zhao Song; Weixin Wang; Junze Yin

A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Yeqi Gao, Zhao Song, Weixin Wang, Junze Yin

Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1381-1452, 2025.

Abstract

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on providing a provable guarantee for the one-layer attention network objective function: given input matrices of a layer, $A_1, A_2, A_3 \in \mathbb{R}^{n \times d}$, our goal is to minimize the loss function: \begin{align*} L(X,Y) = \sum_{j_0=1}^n \sum_{i_0=1}^d ( ⟨⟨\exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n ⟩^{-1} \cdot \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{\*,i_0} ⟩- b_{j_0,i_0} )^2, \end{align*} where $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of the Kronecker product of $A_1$ and $A_2$. The matrix $B \in \mathbb{R}^{n \times d}$ represents the output of a layer, and $b_{j_0,i_0} \in \mathbb{R}$ is the $(j_0, i_0)$-th entry of $B$. $Y_{*,i_0} \in \mathbb{R}^d$ is the $i_0$-th column vector of $Y$, and $x \in \mathbb{R}^{d^2}$ is the vectorization of $X$. In self-attention, $Q, K, V \in \mathbb{R}^{d \times d}$ represent the weight matrices for the query, key, and value, respectively. Unlike prior works that relied on simplified and single-variable versions of the attention computation problem, our multivariate loss function analyzes a complete and unsimplified attention layer, treating all these weights as variables, where $X = QK^\top \in \mathbb{R}^{d \times d}$ and $Y = V \in \mathbb{R}^{d \times d}$. We propose an iterative greedy algorithm to train a neural network using the loss function $L(X,Y)$, achieving an error bound of $\epsilon \in (0, 0.1)$. The algorithm runs in $O( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ time, where ${\cal T}_{\mathrm{mat}}(a,b,c)$ denotes the time complexity of multiplying an $a \times b$ matrix with a $b \times c$ matrix, and $\omega \approx 2.37$ is the exponent of matrix multiplication.

Cite this Paper

BibTeX

@InProceedings{pmlr-v286-gao25a,
  title = 	 {A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time},
  author =       {Gao, Yeqi and Song, Zhao and Wang, Weixin and Yin, Junze},
  booktitle = 	 {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {1381--1452},
  year = 	 {2025},
  editor = 	 {Chiappa, Silvia and Magliacane, Sara},
  volume = 	 {286},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--25 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v286/main/assets/gao25a/gao25a.pdf},
  url = 	 {https://proceedings.mlr.press/v286/gao25a.html},
  abstract = 	 {Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on providing a provable guarantee for the one-layer attention network objective function: given input matrices of a layer, $A_1, A_2, A_3 \in \mathbb{R}^{n \times d}$, our goal is to minimize the loss function: \begin{align*} L(X,Y) = \sum_{j_0=1}^n \sum_{i_0=1}^d ( ⟨⟨\exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n ⟩^{-1} \cdot \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{\*,i_0} ⟩- b_{j_0,i_0} )^2, \end{align*} where $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of the Kronecker product of $A_1$ and $A_2$. The matrix $B \in \mathbb{R}^{n \times d}$ represents the output of a layer, and $b_{j_0,i_0} \in \mathbb{R}$ is the $(j_0, i_0)$-th entry of $B$. $Y_{*,i_0} \in \mathbb{R}^d$ is the $i_0$-th column vector of $Y$, and $x \in \mathbb{R}^{d^2}$ is the vectorization of $X$. In self-attention, $Q, K, V \in \mathbb{R}^{d \times d}$ represent the weight matrices for the query, key, and value, respectively. Unlike prior works that relied on simplified and single-variable versions of the attention computation problem, our multivariate loss function analyzes a complete and unsimplified attention layer, treating all these weights as variables, where $X = QK^\top \in \mathbb{R}^{d \times d}$ and $Y = V \in \mathbb{R}^{d \times d}$. We propose an iterative greedy algorithm to train a neural network using the loss function $L(X,Y)$, achieving an error bound of $\epsilon \in (0, 0.1)$. The algorithm runs in $O( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ time, where ${\cal T}_{\mathrm{mat}}(a,b,c)$ denotes the time complexity of multiplying an $a \times b$ matrix with a $b \times c$ matrix, and $\omega \approx 2.37$ is the exponent of matrix multiplication.}
}

Endnote

%0 Conference Paper
%T A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
%A Yeqi Gao
%A Zhao Song
%A Weixin Wang
%A Junze Yin
%B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2025
%E Silvia Chiappa
%E Sara Magliacane	
%F pmlr-v286-gao25a
%I PMLR
%P 1381--1452
%U https://proceedings.mlr.press/v286/gao25a.html
%V 286
%X Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on providing a provable guarantee for the one-layer attention network objective function: given input matrices of a layer, $A_1, A_2, A_3 \in \mathbb{R}^{n \times d}$, our goal is to minimize the loss function: \begin{align*} L(X,Y) = \sum_{j_0=1}^n \sum_{i_0=1}^d ( ⟨⟨\exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n ⟩^{-1} \cdot \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{\*,i_0} ⟩- b_{j_0,i_0} )^2, \end{align*} where $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of the Kronecker product of $A_1$ and $A_2$. The matrix $B \in \mathbb{R}^{n \times d}$ represents the output of a layer, and $b_{j_0,i_0} \in \mathbb{R}$ is the $(j_0, i_0)$-th entry of $B$. $Y_{*,i_0} \in \mathbb{R}^d$ is the $i_0$-th column vector of $Y$, and $x \in \mathbb{R}^{d^2}$ is the vectorization of $X$. In self-attention, $Q, K, V \in \mathbb{R}^{d \times d}$ represent the weight matrices for the query, key, and value, respectively. Unlike prior works that relied on simplified and single-variable versions of the attention computation problem, our multivariate loss function analyzes a complete and unsimplified attention layer, treating all these weights as variables, where $X = QK^\top \in \mathbb{R}^{d \times d}$ and $Y = V \in \mathbb{R}^{d \times d}$. We propose an iterative greedy algorithm to train a neural network using the loss function $L(X,Y)$, achieving an error bound of $\epsilon \in (0, 0.1)$. The algorithm runs in $O( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ time, where ${\cal T}_{\mathrm{mat}}(a,b,c)$ denotes the time complexity of multiplying an $a \times b$ matrix with a $b \times c$ matrix, and $\omega \approx 2.37$ is the exponent of matrix multiplication.

APA

Gao, Y., Song, Z., Wang, W. & Yin, J.. (2025). A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:1381-1452 Available from https://proceedings.mlr.press/v286/gao25a.html.

A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Abstract

Cite this Paper

Related Material