Catformer: Designing Stable Transformers via Sensitivity Analysis

Jared Q Davis, Albert Gu, Krzysztof Choromanski, Tri Dao, Christopher Re, Chelsea Finn, Percy Liang
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:2489-2499, 2021.

Abstract

Transformer architectures are widely used, but training them is non-trivial, requiring custom learning rate schedules, scaling terms, residual connections, careful placement of submodules such as normalization, and so on. In this paper, we improve upon recent analysis of Transformers and formalize a notion of sensitivity to capture the difficulty of training. Sensitivity characterizes how the variance of activation and gradient norms change in expectation when parameters are randomly perturbed. We analyze the sensitivity of previous Transformer architectures and design a new architecture, the Catformer, which replaces residual connections or RNN-based gating mechanisms with concatenation. We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL—the state-of-the-art architecture designed to address stability—by 13%.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-davis21a, title = {Catformer: Designing Stable Transformers via Sensitivity Analysis}, author = {Davis, Jared Q and Gu, Albert and Choromanski, Krzysztof and Dao, Tri and Re, Christopher and Finn, Chelsea and Liang, Percy}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {2489--2499}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/davis21a/davis21a.pdf}, url = {https://proceedings.mlr.press/v139/davis21a.html}, abstract = {Transformer architectures are widely used, but training them is non-trivial, requiring custom learning rate schedules, scaling terms, residual connections, careful placement of submodules such as normalization, and so on. In this paper, we improve upon recent analysis of Transformers and formalize a notion of sensitivity to capture the difficulty of training. Sensitivity characterizes how the variance of activation and gradient norms change in expectation when parameters are randomly perturbed. We analyze the sensitivity of previous Transformer architectures and design a new architecture, the Catformer, which replaces residual connections or RNN-based gating mechanisms with concatenation. We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL—the state-of-the-art architecture designed to address stability—by 13%.} }
Endnote
%0 Conference Paper %T Catformer: Designing Stable Transformers via Sensitivity Analysis %A Jared Q Davis %A Albert Gu %A Krzysztof Choromanski %A Tri Dao %A Christopher Re %A Chelsea Finn %A Percy Liang %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-davis21a %I PMLR %P 2489--2499 %U https://proceedings.mlr.press/v139/davis21a.html %V 139 %X Transformer architectures are widely used, but training them is non-trivial, requiring custom learning rate schedules, scaling terms, residual connections, careful placement of submodules such as normalization, and so on. In this paper, we improve upon recent analysis of Transformers and formalize a notion of sensitivity to capture the difficulty of training. Sensitivity characterizes how the variance of activation and gradient norms change in expectation when parameters are randomly perturbed. We analyze the sensitivity of previous Transformer architectures and design a new architecture, the Catformer, which replaces residual connections or RNN-based gating mechanisms with concatenation. We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL—the state-of-the-art architecture designed to address stability—by 13%.
APA
Davis, J.Q., Gu, A., Choromanski, K., Dao, T., Re, C., Finn, C. & Liang, P.. (2021). Catformer: Designing Stable Transformers via Sensitivity Analysis. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:2489-2499 Available from https://proceedings.mlr.press/v139/davis21a.html.

Related Material