Position: Understanding LLMs Requires More Than Statistical Generalization

Patrik Reizinger, Szilvia Ujváry, Anna Mészáros, Anna Kerekes, Wieland Brendel, Ferenc Huszár
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42365-42390, 2024.

Abstract

The last decade has seen blossoming research in deep learning theory attempting to answer, “Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart—thus, equivalent test loss—can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-reizinger24a, title = {Position: Understanding {LLM}s Requires More Than Statistical Generalization}, author = {Reizinger, Patrik and Ujv\'{a}ry, Szilvia and M\'{e}sz\'{a}ros, Anna and Kerekes, Anna and Brendel, Wieland and Husz\'{a}r, Ferenc}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {42365--42390}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/reizinger24a/reizinger24a.pdf}, url = {https://proceedings.mlr.press/v235/reizinger24a.html}, abstract = {The last decade has seen blossoming research in deep learning theory attempting to answer, “Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart—thus, equivalent test loss—can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.} }
Endnote
%0 Conference Paper %T Position: Understanding LLMs Requires More Than Statistical Generalization %A Patrik Reizinger %A Szilvia Ujváry %A Anna Mészáros %A Anna Kerekes %A Wieland Brendel %A Ferenc Huszár %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-reizinger24a %I PMLR %P 42365--42390 %U https://proceedings.mlr.press/v235/reizinger24a.html %V 235 %X The last decade has seen blossoming research in deep learning theory attempting to answer, “Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart—thus, equivalent test loss—can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.
APA
Reizinger, P., Ujváry, S., Mészáros, A., Kerekes, A., Brendel, W. & Huszár, F.. (2024). Position: Understanding LLMs Requires More Than Statistical Generalization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:42365-42390 Available from https://proceedings.mlr.press/v235/reizinger24a.html.

Related Material