[edit]
On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:6197-6309, 2026.
Abstract
Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of work has begun to analyze this paradigm, existing bounds leave open the question of how sharp current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage $M$-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the \emph{intrinsic} parameters of the pre-training representation, which we link with the downstream predictor through a notion of \emph{orbit-invariance}, precisely characterizing the limiting distribution of the downstream test risk. We apply our results to spectral pre-training, factor models, and Gaussian mixture models, obtaining substantial improvements in problem-specific factors over prior art when applicable.