What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

Xingwu Chen, Difan Zou
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:7972-8001, 2024.

Abstract

We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization. We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. Then, we show that exhibiting reasoning and generalization ability requires the transformer to have at least two attention layers, while context generalization ability may necessitate three attention layers. Additionally, we identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations and thus can be resolved by stacking multiple attention layers. This sheds light on studying more practical and complex tasks beyond our design. Numerical experiments corroborate our theoretical findings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-chen24bp, title = {What Can Transformer Learn with Varying Depth? {C}ase Studies on Sequence Learning Tasks}, author = {Chen, Xingwu and Zou, Difan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {7972--8001}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/chen24bp/chen24bp.pdf}, url = {https://proceedings.mlr.press/v235/chen24bp.html}, abstract = {We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization. We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. Then, we show that exhibiting reasoning and generalization ability requires the transformer to have at least two attention layers, while context generalization ability may necessitate three attention layers. Additionally, we identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations and thus can be resolved by stacking multiple attention layers. This sheds light on studying more practical and complex tasks beyond our design. Numerical experiments corroborate our theoretical findings.} }
Endnote
%0 Conference Paper %T What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks %A Xingwu Chen %A Difan Zou %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-chen24bp %I PMLR %P 7972--8001 %U https://proceedings.mlr.press/v235/chen24bp.html %V 235 %X We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization. We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. Then, we show that exhibiting reasoning and generalization ability requires the transformer to have at least two attention layers, while context generalization ability may necessitate three attention layers. Additionally, we identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations and thus can be resolved by stacking multiple attention layers. This sheds light on studying more practical and complex tasks beyond our design. Numerical experiments corroborate our theoretical findings.
APA
Chen, X. & Zou, D.. (2024). What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:7972-8001 Available from https://proceedings.mlr.press/v235/chen24bp.html.

Related Material