[edit]
Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:49008-49028, 2024.
Abstract
The attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1, Transformers, GPT-2, 3, 3.5 and 4. Inspired by previous theoretical study of static version of the attention multiplication problem [Zandieh, Han, Daliri, and Karbasi ICML 2023, Alman and Song NeurIPS 2023], we formally define a dynamic version of attention matrix multiplication problem. In each iteration we update one entry in key matrix K∈Rn×d or value matrix V∈Rn×d. In the query stage, we receive (i,j)∈[n]×[d] as input, and want to answer (D−1AV)i,j, where A:=exp(QK⊤)∈Rn×n is a square matrix and D:=diag(A1n)∈Rn×n is a diagonal matrix and 1n denotes a length-n vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound. Inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses O(nω(1,1,τ)−τ) amortized update time, and O(n1+τ) worst-case query time, where nω(1,1,τ) denotes (n,n,nτ) with matrix multiplication exponent ω and τ denotes a constant in (0,1]. We also show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both O(nω(1,1,τ)−τ−Ω(1)) amortized update time, and O(n1+τ−Ω(1)) worst query time.