Break and Make: Interactive Structural Understanding Using LEGO Bricks

Aaron Walsman¹²,
Muru Zhang¹²,
Klemen Kotar¹³,
Karthik Desingh¹²,
Ali Farhadi¹² &
…
Dieter Fox^12,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13688))

Included in the following conference series:

European Conference on Computer Vision

3026 Accesses

Abstract

Visual understanding of geometric structures with complex spatial relationships is a fundamental component of human intelligence. As children, we learn how to reason about structure not only from observation, but also by interacting with the world around us – by taking things apart and putting them back together again. The ability to reason about structure and compositionality allows us to not only build things, but also understand and reverse-engineer complex systems. In order to advance research in interactive reasoning for part-based geometric understanding, we propose a challenging new assembly problem using LEGO bricks that we call Break and Make. In this problem an agent is given a LEGO model and attempts to understand its structure by interactively inspecting and disassembling it. After this inspection period, the agent must then prove its understanding by rebuilding the model from scratch using low-level action primitives. In order to facilitate research on this problem we have built LTRON, a fully interactive 3D simulator that allows learning agents to assemble, disassemble and manipulate LEGO models. We pair this simulator with a new dataset of fan-made LEGO creations that have been uploaded to the internet in order to provide complex scenes containing over a thousand unique brick shapes. We take a first step towards solving this problem using sequence-to-sequence models that provide guidance for how to make progress on this challenging problem. Our simulator and data are available at github.com/aaronwalsman/ltron. Additional training code and PyTorch examples are available at github.com/aaronwalsman/ltron-torch-eccv22.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning to Build by Building Your Own Instructions

TreeSBA: Tree-Transformer for Self-supervised Sequential Brick Assembly

Physical Primitive Decomposition

References

Bapst, V., et al.: Structured agents for physical construction. In: International Conference on Machine Learning, pp. 464–474. PMLR (2019)
Google Scholar
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chung, H., et al.: Brick-by-brick: combinatorial construction with deep reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 5745–5757 (2021)
Google Scholar
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10 (2018)
Google Scholar
Deitke, M., et al.: Robothor: an open simulation-to-real embodied AI platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3164–3174 (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785. IEEE (2009)
Google Scholar
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. arXiv preprint arXiv:1605.07157 (2016)
Ghasemipour, S.K.S., Kataoka, S., David, B., Freeman, D., Gu, S.S., Mordatch, I.: Blocks assemble! learning to assemble with large-scale structured reinforcement learning. In: International Conference on Machine Learning, pp. 7435–7469. PMLR (2022)
Google Scholar
Girdhar, R., Ramanan, D.: Cater: A diagnostic dataset for compositional actions and temporal reasoning. arXiv:abs/1910.04744 (2020)
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)
Google Scholar
Gupta, A., Fox, D., Curless, B., Cohen, M.: DuploTrack: a reatime system for authoring and guiding Duplo model assembly. In: Proceedings of the 25th Annual ACM Symposium Adjunct on User Interface Software and Technology. ACM, New York (2012)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, J., Smith, C., Bastani, O., Singh, R., Albarghouthi, A., Naik, M.: Generating programmatic referring expressions via program synthesis. In: International Conference on Machine Learning, pp. 4495–4506. PMLR (2020)
Google Scholar
Inoue, T., De Magistris, G., Munawar, A., Yokoya, T., Tachibana, R.: Deep reinforcement learning for high precision assembly tasks. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 819–825. IEEE (2017)
Google Scholar
Jaegle, A., et al.: Perceiver IO: a general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021)
Jessiman, J., et al.: Open Model Repository. https://omr.ldraw.org
Jessiman, J., et al.: LDraw. https://www.ldraw.org (2022)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Google Scholar
Jones, B., Hildreth, D., Chen, D., Baran, I., Kim, V.G., Schulz, A.: Automate: a dataset and learning approach for automatic mating of cad assemblies. ACM Trans. Graph. (TOG) 40(6), 1–18 (2021)
Article Google Scholar
Jones, R.K., et al.: Shapeassembly: learning to generate programs for 3D shape structure synthesis. ACM Trans. Graph. (TOG) 39(6), 1–20 (2020)
Article Google Scholar
Kim, J.: Survey on automated LEGO assembly construction (2015)
Google Scholar
Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations (2016). https://arxiv.org/abs/1602.07332
Lee, S., Kim, J., Kim, J.W., Moon, B.R.: Finding an optimal LEGO brick layout of voxelized 3D object using a genetic algorithm. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1215–1222 (2015)
Google Scholar
Lee, Y., Hu, E.S., Lim, J.J.: Ikea furniture assembly environment for long-horizon complex manipulation tasks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6343–6349. IEEE (2021)
Google Scholar
Lennon, K., et al.: Image2Lego: customized LEGO set generation from images. arXiv preprint arXiv:2108.08477 (2021)
Li, Y., Mo, K., Shao, L., Sung, M., Guibas, L.: Learning 3D part assembly from a single image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 664–682. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_40
Chapter Google Scholar
Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: fine pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2992–2999 (2013)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Google Scholar
McCormac, J., Clark, R., Bloesch, M., Davison, A., Leutenegger, S.: Fusion++: Volumetric object-level slam. In: 2018 International Conference on 3D vision (3DV), pp. 32–41. IEEE (2018)
Google Scholar
Melkert, R.: LDCad (2017). https://www.melkert.net/LDCad
Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2019)
Google Scholar
Nandi, C., et al.: Synthesizing structured cad models with equality saturation and inverse transformations. In: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 31–44 (2020)
Google Scholar
Peysakhov, M., Regli, W.: Using assembly representations to enable evolutionary design of LEGO structures. Artif. Intell. Eng. Des. Anal. Manuf. 17, 155–168 (2003)
Article Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Rosinol, A., Gupta, A., Abate, M., Shi, J., Carlone, L.: 3D dynamic scene graphs: actionable spatial perception with places, objects, and humans (2020)
Google Scholar
Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635. JMLR Workshop and Conference Proceedings (2011)
Google Scholar
Salas-Moreno, R.F., et al.: Slam++: simultaneous localisation and mapping at the level of objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1352–1359 (2013)
Google Scholar
Savarimuthu, T.R., et al.: Teaching a robot the semantics of assembly tasks. IEEE Trans. Syste. Man Cybern. Syst. 48(5), 670–692 (2017)
Article Google Scholar
Shen, B., et al.: iGibson, a simulation environment for interactive tasks in large realisticscenes. arXiv preprint arXiv:2012.02924 (2020)
Shridhar, M., et al.: Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10740–10749 (2020)
Google Scholar
Suárez-Ruiz, F., Zhou, X., Pham, Q.C.: Can robots assemble an IKEA chair? Sci. Robot. 3(17), eaat6385 (2018)
Google Scholar
Sucar, E., Wada, K., Davison, A.: NodeSLAM: neural object descriptors for multi-view shape reconstruction. In: 2020 International Conference on 3D Vision (3DV), pp. 949–958. IEEE (2020)
Google Scholar
Sui, Z., Chang, H., Xu, N., Jenkins, O.C.: Geofusion: geometric consistency informed scene estimation in dense clutter. IEEE Robot. Autom. Lette. 5(4), 5913–5920 (2020)
Article Google Scholar
Szot, A., et al.: Habitat 2.0: training home assistants to rearrange their habitat. In: Advances in Neural Information Processing Systems, vol. 34, 251–266 (2021)
Google Scholar
Thompson, R., Ghalebi, E., DeVries, T., Taylor, G.W.: Building LEGO using deep generative models of graphs. arXiv preprint arXiv:2012.11543 (2020)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Vineet, V., et al.: Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 75–82. IEEE (2015)
Google Scholar
Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part localization with humans in the loop. In: 2011 International Conference on Computer Vision, pp. 2524–2531. IEEE (2011)
Google Scholar
Wani, S., Patel, S., Jain, U., Chang, A.X., Savva, M.: Multi-on: benchmarking semantic map memory using multi-object navigation. In: Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Willis, K.D., et al.: Fusion 360 gallery: a dataset and environment for programmatic cad reconstruction. arXiv preprint arXiv:2010.02392 (2020)
Winograd, T.: SHRDLU: a system for dialog (1972)
Google Scholar
Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: learning physical object properties from unlabeled videos. In: BMVC, vol. 2, p. 7 (2016)
Google Scholar
Wu, J., Lu, E., Kohli, P., Freeman, B., Tenenbaum, J.: Learning to see physics via visual de-animation. In: NIPS, pp. 153–164 (2017)
Google Scholar
Wu, J., Yildirim, I., Lim, J.J., Freeman, B., Tenenbaum, J.: Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: Advances in Neural Information Processing Systems, vol. 28, pp. 127–135 (2015)
Google Scholar
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning (2020)
Google Scholar
Zhao, X., Agrawal, H., Batra, D., Schwing, A.: The surprising effectiveness of visual odometry techniques for embodied PointGoal navigation. In: Proceedings of the ICCV (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Washington, Seattle, USA
Aaron Walsman, Muru Zhang, Karthik Desingh, Ali Farhadi & Dieter Fox
Allen Institute for Artificial Intelligence, Seattle, USA
Klemen Kotar
NVIDIA, Santa Clara, USA
Dieter Fox

Authors

Aaron Walsman
View author publications
You can also search for this author in PubMed Google Scholar
Muru Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Klemen Kotar
View author publications
You can also search for this author in PubMed Google Scholar
Karthik Desingh
View author publications
You can also search for this author in PubMed Google Scholar
Ali Farhadi
View author publications
You can also search for this author in PubMed Google Scholar
Dieter Fox
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aaron Walsman .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 588 KB)

Supplementary material 2 (pdf 3369 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walsman, A., Zhang, M., Kotar, K., Desingh, K., Farhadi, A., Fox, D. (2022). Break and Make: Interactive Structural Understanding Using LEGO Bricks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13688. Springer, Cham. https://doi.org/10.1007/978-3-031-19815-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-19815-1_6
Published: 20 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19814-4
Online ISBN: 978-3-031-19815-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Break and Make: Interactive Structural Understanding Using LEGO Bricks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning to Build by Building Your Own Instructions

TreeSBA: Tree-Transformer for Self-supervised Sequential Brick Assembly

Physical Primitive Decomposition

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 2 (pdf 3369 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Break and Make: Interactive Structural Understanding Using LEGO Bricks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning to Build by Building Your Own Instructions

TreeSBA: Tree-Transformer for Self-supervised Sequential Brick Assembly

Physical Primitive Decomposition

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 2 (pdf 3369 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation