Steerable Scene Generation with Post Training and Inference-Time Search

Nicholas Ezra Pfaff, Hongkai Dai, Sergey Zakharov, Shun Iwase, Russ Tedrake
Proceedings of The 9th Conference on Robot Learning, PMLR 305:1690-1702, 2025.

Abstract

Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-pfaff25a, title = {Steerable Scene Generation with Post Training and Inference-Time Search}, author = {Pfaff, Nicholas Ezra and Dai, Hongkai and Zakharov, Sergey and Iwase, Shun and Tedrake, Russ}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {1690--1702}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/pfaff25a/pfaff25a.pdf}, url = {https://proceedings.mlr.press/v305/pfaff25a.html}, abstract = {Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments.} }
Endnote
%0 Conference Paper %T Steerable Scene Generation with Post Training and Inference-Time Search %A Nicholas Ezra Pfaff %A Hongkai Dai %A Sergey Zakharov %A Shun Iwase %A Russ Tedrake %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-pfaff25a %I PMLR %P 1690--1702 %U https://proceedings.mlr.press/v305/pfaff25a.html %V 305 %X Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments.
APA
Pfaff, N.E., Dai, H., Zakharov, S., Iwase, S. & Tedrake, R.. (2025). Steerable Scene Generation with Post Training and Inference-Time Search. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:1690-1702 Available from https://proceedings.mlr.press/v305/pfaff25a.html.

Related Material