How PARTs assemble into wholes: Learning the relative composition of images

Melika Ayoughi, Samira Abnar, Chen Huang, Christopher Michael Sandino, Sayeri Lala, Eeshan Gunesh Dhekane, Dan Busbridge, Shuangfei Zhai, Vimal Thilak, Joshua M. Susskind, Pascal Mettes, Paul Groth, Hanlin Goh
Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), PMLR 307:15-26, 2026.

Abstract

The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images an off-grid structural relative positioning that is less tied to absolute appearance and can remain coherent under variations such as partial visibility or stylistic changes. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms grid-based methods like MAE and DropPos, while maintaining competitive performance on global classification tasks. By breaking free from grid constraints, PART opens up a new trajectory for universal self-supervised pretraining across diverse datatypes from images to EEG signals with potential in medical imaging, video, and audio.

Cite this Paper


BibTeX
@InProceedings{pmlr-v307-ayoughi26a, title = {How {PART}s assemble into wholes: Learning the relative composition of images}, author = {Ayoughi, Melika and Abnar, Samira and Huang, Chen and Sandino, Christopher Michael and Lala, Sayeri and Dhekane, Eeshan Gunesh and Busbridge, Dan and Zhai, Shuangfei and Thilak, Vimal and Susskind, Joshua M. and Mettes, Pascal and Groth, Paul and Goh, Hanlin}, booktitle = {Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL)}, pages = {15--26}, year = {2026}, editor = {Kim, Hyeongji and Ramírez Rivera, Adín and Ricaud, Benjamin}, volume = {307}, series = {Proceedings of Machine Learning Research}, month = {06--08 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v307/main/assets/ayoughi26a/ayoughi26a.pdf}, url = {https://proceedings.mlr.press/v307/ayoughi26a.html}, abstract = {The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images an off-grid structural relative positioning that is less tied to absolute appearance and can remain coherent under variations such as partial visibility or stylistic changes. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms grid-based methods like MAE and DropPos, while maintaining competitive performance on global classification tasks. By breaking free from grid constraints, PART opens up a new trajectory for universal self-supervised pretraining across diverse datatypes from images to EEG signals with potential in medical imaging, video, and audio.} }
Endnote
%0 Conference Paper %T How PARTs assemble into wholes: Learning the relative composition of images %A Melika Ayoughi %A Samira Abnar %A Chen Huang %A Christopher Michael Sandino %A Sayeri Lala %A Eeshan Gunesh Dhekane %A Dan Busbridge %A Shuangfei Zhai %A Vimal Thilak %A Joshua M. Susskind %A Pascal Mettes %A Paul Groth %A Hanlin Goh %B Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL) %C Proceedings of Machine Learning Research %D 2026 %E Hyeongji Kim %E Adín Ramírez Rivera %E Benjamin Ricaud %F pmlr-v307-ayoughi26a %I PMLR %P 15--26 %U https://proceedings.mlr.press/v307/ayoughi26a.html %V 307 %X The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images an off-grid structural relative positioning that is less tied to absolute appearance and can remain coherent under variations such as partial visibility or stylistic changes. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms grid-based methods like MAE and DropPos, while maintaining competitive performance on global classification tasks. By breaking free from grid constraints, PART opens up a new trajectory for universal self-supervised pretraining across diverse datatypes from images to EEG signals with potential in medical imaging, video, and audio.
APA
Ayoughi, M., Abnar, S., Huang, C., Sandino, C.M., Lala, S., Dhekane, E.G., Busbridge, D., Zhai, S., Thilak, V., Susskind, J.M., Mettes, P., Groth, P. & Goh, H.. (2026). How PARTs assemble into wholes: Learning the relative composition of images. Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), in Proceedings of Machine Learning Research 307:15-26 Available from https://proceedings.mlr.press/v307/ayoughi26a.html.

Related Material