[edit]
Outlier-aware Slicing for Post-Training Quantization in Vision Transformer
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:33811-33825, 2024.
Abstract
Post-Training Quantization (PTQ) is a vital technique for network compression and acceleration, gaining prominence as model sizes increase. This paper addresses a critical challenge in PTQ: the severe impact of outliers on the accuracy of quantized transformer architectures. Specifically, we introduce the concept of ‘reconstruction granularity’ as a novel solution to this issue, which has been overlooked in previous works. Our work provides theoretical insights into the role of reconstruction granularity in mitigating the outlier problem in transformer models. This theoretical framework is supported by empirical analysis, demonstrating that varying reconstruction granularities significantly influence quantization performance. Our findings indicate that different architectural designs necessitate distinct optimal reconstruction granularities. For instance, the multi-stage Swin Transformer architecture benefits from finer granularity, a deviation from the trends observed in ViT and DeiT models. We further develop an algorithm for determining the optimal reconstruction granularity for various ViT models, achieving state-of-the-art (SOTA) performance in PTQ. For example, applying our method to $4$-bit quantization, the Swin-Base model achieves a Top-1 accuracy of $82.24%$ on the ImageNet classification task. This result surpasses the RepQ-ViT by $3.92%$ ($82.24%$ VS $78.32%$). Similarly, our approach elevates the ViT-Small to a Top-1 accuracy of $80.50%$, outperforming NoisyQuant by $3.64%$ ($80.50%$ VS $76.86%$). Codes are available in Supplementary Materials.