Kernel Mean Embeddings of \texttt[CLS] Tokens in ViTs

Mason Faldet
Proceedings of the 1st Conference on Topology, Algebra, and Geometry in Data Science(TAG-DS 2025), PMLR 321:100-113, 2026.

Abstract

We study the geometry of Vision Transformer (\texttt{ViT}) \texttt{[CLS]} representations across layers through the lens of reproducing kernel Hilbert spaces (RKHS). For each layer and class, we estimate class-conditional kernel mean embeddings (KMEs) and measure separability with maximum mean discrepancy (MMD), tuning the kernel and a per-layer PCA projection via a bootstrap-based signal-to-noise (SNR) objective. We further propose a layer-wise confidence signal by evaluating class mean embeddings along a query’s \texttt{[CLS]} trajectory. On ImageNet-1k subsets, this exploratory, proof-of-concept analysis indicates that the RKHS framework can capture meaningful geometric and semantic signals in \texttt{[CLS]} representations across ViT layers. We make no SOTA claims; our contribution is a unified framework and practical recipe for probing \texttt{[CLS]} geometry.

Cite this Paper


BibTeX
@InProceedings{pmlr-v321-faldet26a, title = {Kernel Mean Embeddings of \texttt{[CLS]} Tokens in ViTs}, author = {Faldet, Mason}, booktitle = {Proceedings of the 1st Conference on Topology, Algebra, and Geometry in Data Science(TAG-DS 2025)}, pages = {100--113}, year = {2026}, editor = {Bernardez Gil, Guillermo and Black, Mitchell and Cloninger, Alexander and Doster, Timothy and Emerson, Tegan and Garcı́a-Rodondo, Ińes and Holtz, Chester and Kotak, Mit and Kvinge, Henry and Mishne, Gal and Papillon, Mathilde and Pouplin, Alison and Rainey, Katie and Rieck, Bastian and Telyatnikov, Lev and Yeats, Eric and Wang, Qingsong and Wang, Yusu and Wayland, Jeremy}, volume = {321}, series = {Proceedings of Machine Learning Research}, month = {01--02 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v321/main/assets/faldet26a/faldet26a.pdf}, url = {https://proceedings.mlr.press/v321/faldet26a.html}, abstract = {We study the geometry of Vision Transformer (\texttt{ViT}) \texttt{[CLS]} representations across layers through the lens of reproducing kernel Hilbert spaces (RKHS). For each layer and class, we estimate class-conditional kernel mean embeddings (KMEs) and measure separability with maximum mean discrepancy (MMD), tuning the kernel and a per-layer PCA projection via a bootstrap-based signal-to-noise (SNR) objective. We further propose a layer-wise confidence signal by evaluating class mean embeddings along a query’s \texttt{[CLS]} trajectory. On ImageNet-1k subsets, this exploratory, proof-of-concept analysis indicates that the RKHS framework can capture meaningful geometric and semantic signals in \texttt{[CLS]} representations across ViT layers. We make no SOTA claims; our contribution is a unified framework and practical recipe for probing \texttt{[CLS]} geometry.} }
Endnote
%0 Conference Paper %T Kernel Mean Embeddings of \texttt[CLS] Tokens in ViTs %A Mason Faldet %B Proceedings of the 1st Conference on Topology, Algebra, and Geometry in Data Science(TAG-DS 2025) %C Proceedings of Machine Learning Research %D 2026 %E Guillermo Bernardez Gil %E Mitchell Black %E Alexander Cloninger %E Timothy Doster %E Tegan Emerson %E Ińes Garcı́a-Rodondo %E Chester Holtz %E Mit Kotak %E Henry Kvinge %E Gal Mishne %E Mathilde Papillon %E Alison Pouplin %E Katie Rainey %E Bastian Rieck %E Lev Telyatnikov %E Eric Yeats %E Qingsong Wang %E Yusu Wang %E Jeremy Wayland %F pmlr-v321-faldet26a %I PMLR %P 100--113 %U https://proceedings.mlr.press/v321/faldet26a.html %V 321 %X We study the geometry of Vision Transformer (\texttt{ViT}) \texttt{[CLS]} representations across layers through the lens of reproducing kernel Hilbert spaces (RKHS). For each layer and class, we estimate class-conditional kernel mean embeddings (KMEs) and measure separability with maximum mean discrepancy (MMD), tuning the kernel and a per-layer PCA projection via a bootstrap-based signal-to-noise (SNR) objective. We further propose a layer-wise confidence signal by evaluating class mean embeddings along a query’s \texttt{[CLS]} trajectory. On ImageNet-1k subsets, this exploratory, proof-of-concept analysis indicates that the RKHS framework can capture meaningful geometric and semantic signals in \texttt{[CLS]} representations across ViT layers. We make no SOTA claims; our contribution is a unified framework and practical recipe for probing \texttt{[CLS]} geometry.
APA
Faldet, M.. (2026). Kernel Mean Embeddings of \texttt[CLS] Tokens in ViTs. Proceedings of the 1st Conference on Topology, Algebra, and Geometry in Data Science(TAG-DS 2025), in Proceedings of Machine Learning Research 321:100-113 Available from https://proceedings.mlr.press/v321/faldet26a.html.

Related Material