Learning Associative Memories with Gradient Descent

Vivien Cabannes, Berfin Simsek, Alberto Bietti
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:5114-5134, 2024.

Abstract

This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the “classification margins.” Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. We also find that underparameterized regimes lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-cabannes24a, title = {Learning Associative Memories with Gradient Descent}, author = {Cabannes, Vivien and Simsek, Berfin and Bietti, Alberto}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {5114--5134}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/cabannes24a/cabannes24a.pdf}, url = {https://proceedings.mlr.press/v235/cabannes24a.html}, abstract = {This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the “classification margins.” Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. We also find that underparameterized regimes lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.} }
Endnote
%0 Conference Paper %T Learning Associative Memories with Gradient Descent %A Vivien Cabannes %A Berfin Simsek %A Alberto Bietti %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-cabannes24a %I PMLR %P 5114--5134 %U https://proceedings.mlr.press/v235/cabannes24a.html %V 235 %X This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the “classification margins.” Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. We also find that underparameterized regimes lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.
APA
Cabannes, V., Simsek, B. & Bietti, A.. (2024). Learning Associative Memories with Gradient Descent. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:5114-5134 Available from https://proceedings.mlr.press/v235/cabannes24a.html.

Related Material