From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning

Noa Rubin; Kirsten Fischer; Javed Lindner; Inbar Seroussi; Zohar Ringel; Michael Krämer; Moritz Helias

From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning

Noa Rubin, Kirsten Fischer, Javed Lindner, Inbar Seroussi, Zohar Ringel, Michael Krämer, Moritz Helias

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52225-52257, 2025.

Abstract

Feature learning in neural networks is crucial for their expressive power and inductive biases, motivating various theoretical approaches. Some approaches describe network behavior after training through a change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving directional changes to the kernel. The relationship and respective strengths of these two views have so far remained unresolved. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these two views. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network’s probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output in the special case of a linear network. However, for linear and non-linear networks, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-rubin25a,
  title = 	 {From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning},
  author =       {Rubin, Noa and Fischer, Kirsten and Lindner, Javed and Seroussi, Inbar and Ringel, Zohar and Kr\"{a}mer, Michael and Helias, Moritz},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {52225--52257},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/rubin25a/rubin25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/rubin25a.html},
  abstract = 	 {Feature learning in neural networks is crucial for their expressive power and inductive biases, motivating various theoretical approaches. Some approaches describe network behavior after training through a change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving directional changes to the kernel. The relationship and respective strengths of these two views have so far remained unresolved. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these two views. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network’s probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output in the special case of a linear network. However, for linear and non-linear networks, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.}
}

Endnote

%0 Conference Paper
%T From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning
%A Noa Rubin
%A Kirsten Fischer
%A Javed Lindner
%A Inbar Seroussi
%A Zohar Ringel
%A Michael Krämer
%A Moritz Helias
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-rubin25a
%I PMLR
%P 52225--52257
%U https://proceedings.mlr.press/v267/rubin25a.html
%V 267
%X Feature learning in neural networks is crucial for their expressive power and inductive biases, motivating various theoretical approaches. Some approaches describe network behavior after training through a change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving directional changes to the kernel. The relationship and respective strengths of these two views have so far remained unresolved. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these two views. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network’s probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output in the special case of a linear network. However, for linear and non-linear networks, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.

APA

Rubin, N., Fischer, K., Lindner, J., Seroussi, I., Ringel, Z., Krämer, M. & Helias, M.. (2025). From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:52225-52257 Available from https://proceedings.mlr.press/v267/rubin25a.html.

From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning

Abstract

Cite this Paper

Related Material