[edit]
Kernel Mean Embeddings of \texttt[CLS] Tokens in ViTs
Proceedings of the 1st Conference on Topology, Algebra, and Geometry in Data Science(TAG-DS 2025), PMLR 321:100-113, 2026.
Abstract
We study the geometry of Vision Transformer (\texttt{ViT}) \texttt{[CLS]} representations across layers through the lens of reproducing kernel Hilbert spaces (RKHS). For each layer and class, we estimate class-conditional kernel mean embeddings (KMEs) and measure separability with maximum mean discrepancy (MMD), tuning the kernel and a per-layer PCA projection via a bootstrap-based signal-to-noise (SNR) objective. We further propose a layer-wise confidence signal by evaluating class mean embeddings along a query’s \texttt{[CLS]} trajectory. On ImageNet-1k subsets, this exploratory, proof-of-concept analysis indicates that the RKHS framework can capture meaningful geometric and semantic signals in \texttt{[CLS]} representations across ViT layers. We make no SOTA claims; our contribution is a unified framework and practical recipe for probing \texttt{[CLS]} geometry.