Steerable Scene Generation with Post Training and Inference-Time Search

Nicholas Ezra Pfaff; Hongkai Dai; Sergey Zakharov; Shun Iwase; Russ Tedrake

Steerable Scene Generation with Post Training and Inference-Time Search

Nicholas Ezra Pfaff, Hongkai Dai, Sergey Zakharov, Shun Iwase, Russ Tedrake

Proceedings of The 9th Conference on Robot Learning, PMLR 305:1690-1702, 2025.

Abstract

Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-pfaff25a,
  title = 	 {Steerable Scene Generation with Post Training and Inference-Time Search},
  author =       {Pfaff, Nicholas Ezra and Dai, Hongkai and Zakharov, Sergey and Iwase, Shun and Tedrake, Russ},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {1690--1702},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/pfaff25a/pfaff25a.pdf},
  url = 	 {https://proceedings.mlr.press/v305/pfaff25a.html},
  abstract = 	 {Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments.}
}

Endnote

%0 Conference Paper
%T Steerable Scene Generation with Post Training and Inference-Time Search
%A Nicholas Ezra Pfaff
%A Hongkai Dai
%A Sergey Zakharov
%A Shun Iwase
%A Russ Tedrake
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-pfaff25a
%I PMLR
%P 1690--1702
%U https://proceedings.mlr.press/v305/pfaff25a.html
%V 305
%X Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments.

APA

Pfaff, N.E., Dai, H., Zakharov, S., Iwase, S. & Tedrake, R.. (2025). Steerable Scene Generation with Post Training and Inference-Time Search. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:1690-1702 Available from https://proceedings.mlr.press/v305/pfaff25a.html.

Steerable Scene Generation with Post Training and Inference-Time Search

Abstract

Cite this Paper

Related Material