“Stack It Up!”: 3D Stable Structure Generation from 2D Hand-drawn Sketch

Yiqing Xu, Linfeng Li, Cunjun Yu, David Hsu
Proceedings of The 9th Conference on Robot Learning, PMLR 305:124-151, 2025.

Abstract

Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today’s robot manipulation systems can’t act on such sketches directly—they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present *StackItUp*, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. *StackItUp* introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., *left-of*) and stability patterns (e.g.,*two-pillar-bridge*) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports—critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, *StackItUp* consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-xu25a, title = {“Stack It Up!”: 3D Stable Structure Generation from 2D Hand-drawn Sketch}, author = {Xu, Yiqing and Li, Linfeng and Yu, Cunjun and Hsu, David}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {124--151}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/xu25a/xu25a.pdf}, url = {https://proceedings.mlr.press/v305/xu25a.html}, abstract = {Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today’s robot manipulation systems can’t act on such sketches directly—they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present *StackItUp*, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. *StackItUp* introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., *left-of*) and stability patterns (e.g.,*two-pillar-bridge*) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports—critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, *StackItUp* consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.} }
Endnote
%0 Conference Paper %T “Stack It Up!”: 3D Stable Structure Generation from 2D Hand-drawn Sketch %A Yiqing Xu %A Linfeng Li %A Cunjun Yu %A David Hsu %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-xu25a %I PMLR %P 124--151 %U https://proceedings.mlr.press/v305/xu25a.html %V 305 %X Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today’s robot manipulation systems can’t act on such sketches directly—they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present *StackItUp*, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. *StackItUp* introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., *left-of*) and stability patterns (e.g.,*two-pillar-bridge*) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports—critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, *StackItUp* consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.
APA
Xu, Y., Li, L., Yu, C. & Hsu, D.. (2025). “Stack It Up!”: 3D Stable Structure Generation from 2D Hand-drawn Sketch. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:124-151 Available from https://proceedings.mlr.press/v305/xu25a.html.

Related Material