Highly Compressed Tokenizer Can Generate Without Training

Lukas Lao Beyer, Tianhong Li, Xinlei Chen, Sertac Karaman, Kaiming He
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:4096-4114, 2025.

Abstract

Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations – such as copying and replacing tokens between latent representations of images – enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer’s latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-beyer25a, title = {Highly Compressed Tokenizer Can Generate Without Training}, author = {Beyer, Lukas Lao and Li, Tianhong and Chen, Xinlei and Karaman, Sertac and He, Kaiming}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {4096--4114}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/beyer25a/beyer25a.pdf}, url = {https://proceedings.mlr.press/v267/beyer25a.html}, abstract = {Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations – such as copying and replacing tokens between latent representations of images – enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer’s latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.} }
Endnote
%0 Conference Paper %T Highly Compressed Tokenizer Can Generate Without Training %A Lukas Lao Beyer %A Tianhong Li %A Xinlei Chen %A Sertac Karaman %A Kaiming He %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-beyer25a %I PMLR %P 4096--4114 %U https://proceedings.mlr.press/v267/beyer25a.html %V 267 %X Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations – such as copying and replacing tokens between latent representations of images – enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer’s latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.
APA
Beyer, L.L., Li, T., Chen, X., Karaman, S. & He, K.. (2025). Highly Compressed Tokenizer Can Generate Without Training. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:4096-4114 Available from https://proceedings.mlr.press/v267/beyer25a.html.

Related Material