[go: up one dir, main page]

Skip to main content

Break and Make: Interactive Structural Understanding Using LEGO Bricks

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13688))

Included in the following conference series:

  • 3026 Accesses

Abstract

Visual understanding of geometric structures with complex spatial relationships is a fundamental component of human intelligence. As children, we learn how to reason about structure not only from observation, but also by interacting with the world around us – by taking things apart and putting them back together again. The ability to reason about structure and compositionality allows us to not only build things, but also understand and reverse-engineer complex systems. In order to advance research in interactive reasoning for part-based geometric understanding, we propose a challenging new assembly problem using LEGO bricks that we call Break and Make. In this problem an agent is given a LEGO model and attempts to understand its structure by interactively inspecting and disassembling it. After this inspection period, the agent must then prove its understanding by rebuilding the model from scratch using low-level action primitives. In order to facilitate research on this problem we have built LTRON, a fully interactive 3D simulator that allows learning agents to assemble, disassemble and manipulate LEGO models. We pair this simulator with a new dataset of fan-made LEGO creations that have been uploaded to the internet in order to provide complex scenes containing over a thousand unique brick shapes. We take a first step towards solving this problem using sequence-to-sequence models that provide guidance for how to make progress on this challenging problem. Our simulator and data are available at github.com/aaronwalsman/ltron. Additional training code and PyTorch examples are available at github.com/aaronwalsman/ltron-torch-eccv22.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bapst, V., et al.: Structured agents for physical construction. In: International Conference on Machine Learning, pp. 464–474. PMLR (2019)

    Google Scholar 

  2. Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)

  3. Chung, H., et al.: Brick-by-brick: combinatorial construction with deep reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 5745–5757 (2021)

    Google Scholar 

  4. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10 (2018)

    Google Scholar 

  5. Deitke, M., et al.: Robothor: an open simulation-to-real embodied AI platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3164–3174 (2020)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  7. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785. IEEE (2009)

    Google Scholar 

  8. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. arXiv preprint arXiv:1605.07157 (2016)

  9. Ghasemipour, S.K.S., Kataoka, S., David, B., Freeman, D., Gu, S.S., Mordatch, I.: Blocks assemble! learning to assemble with large-scale structured reinforcement learning. In: International Conference on Machine Learning, pp. 7435–7469. PMLR (2022)

    Google Scholar 

  10. Girdhar, R., Ramanan, D.: Cater: A diagnostic dataset for compositional actions and temporal reasoning. arXiv:abs/1910.04744 (2020)

  11. Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)

    Google Scholar 

  12. Gupta, A., Fox, D., Curless, B., Cohen, M.: DuploTrack: a reatime system for authoring and guiding Duplo model assembly. In: Proceedings of the 25th Annual ACM Symposium Adjunct on User Interface Software and Technology. ACM, New York (2012)

    Google Scholar 

  13. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  15. Huang, J., Smith, C., Bastani, O., Singh, R., Albarghouthi, A., Naik, M.: Generating programmatic referring expressions via program synthesis. In: International Conference on Machine Learning, pp. 4495–4506. PMLR (2020)

    Google Scholar 

  16. Inoue, T., De Magistris, G., Munawar, A., Yokoya, T., Tachibana, R.: Deep reinforcement learning for high precision assembly tasks. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 819–825. IEEE (2017)

    Google Scholar 

  17. Jaegle, A., et al.: Perceiver IO: a general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021)

  18. Jessiman, J., et al.: Open Model Repository. https://omr.ldraw.org

  19. Jessiman, J., et al.: LDraw. https://www.ldraw.org (2022)

  20. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)

    Google Scholar 

  21. Jones, B., Hildreth, D., Chen, D., Baran, I., Kim, V.G., Schulz, A.: Automate: a dataset and learning approach for automatic mating of cad assemblies. ACM Trans. Graph. (TOG) 40(6), 1–18 (2021)

    Article  Google Scholar 

  22. Jones, R.K., et al.: Shapeassembly: learning to generate programs for 3D shape structure synthesis. ACM Trans. Graph. (TOG) 39(6), 1–20 (2020)

    Article  Google Scholar 

  23. Kim, J.: Survey on automated LEGO assembly construction (2015)

    Google Scholar 

  24. Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017)

  25. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations (2016). https://arxiv.org/abs/1602.07332

  26. Lee, S., Kim, J., Kim, J.W., Moon, B.R.: Finding an optimal LEGO brick layout of voxelized 3D object using a genetic algorithm. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1215–1222 (2015)

    Google Scholar 

  27. Lee, Y., Hu, E.S., Lim, J.J.: Ikea furniture assembly environment for long-horizon complex manipulation tasks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6343–6349. IEEE (2021)

    Google Scholar 

  28. Lennon, K., et al.: Image2Lego: customized LEGO set generation from images. arXiv preprint arXiv:2108.08477 (2021)

  29. Li, Y., Mo, K., Shao, L., Sung, M., Guibas, L.: Learning 3D part assembly from a single image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 664–682. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_40

    Chapter  Google Scholar 

  30. Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: fine pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2992–2999 (2013)

    Google Scholar 

  31. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

    Google Scholar 

  32. McCormac, J., Clark, R., Bloesch, M., Davison, A., Leutenegger, S.: Fusion++: Volumetric object-level slam. In: 2018 International Conference on 3D vision (3DV), pp. 32–41. IEEE (2018)

    Google Scholar 

  33. Melkert, R.: LDCad (2017). https://www.melkert.net/LDCad

  34. Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2019)

    Google Scholar 

  35. Nandi, C., et al.: Synthesizing structured cad models with equality saturation and inverse transformations. In: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 31–44 (2020)

    Google Scholar 

  36. Peysakhov, M., Regli, W.: Using assembly representations to enable evolutionary design of LEGO structures. Artif. Intell. Eng. Des. Anal. Manuf. 17, 155–168 (2003)

    Article  Google Scholar 

  37. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  38. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  39. Rosinol, A., Gupta, A., Abate, M., Shi, J., Carlone, L.: 3D dynamic scene graphs: actionable spatial perception with places, objects, and humans (2020)

    Google Scholar 

  40. Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635. JMLR Workshop and Conference Proceedings (2011)

    Google Scholar 

  41. Salas-Moreno, R.F., et al.: Slam++: simultaneous localisation and mapping at the level of objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1352–1359 (2013)

    Google Scholar 

  42. Savarimuthu, T.R., et al.: Teaching a robot the semantics of assembly tasks. IEEE Trans. Syste. Man Cybern. Syst. 48(5), 670–692 (2017)

    Article  Google Scholar 

  43. Shen, B., et al.: iGibson, a simulation environment for interactive tasks in large realisticscenes. arXiv preprint arXiv:2012.02924 (2020)

  44. Shridhar, M., et al.: Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10740–10749 (2020)

    Google Scholar 

  45. Suárez-Ruiz, F., Zhou, X., Pham, Q.C.: Can robots assemble an IKEA chair? Sci. Robot. 3(17), eaat6385 (2018)

    Google Scholar 

  46. Sucar, E., Wada, K., Davison, A.: NodeSLAM: neural object descriptors for multi-view shape reconstruction. In: 2020 International Conference on 3D Vision (3DV), pp. 949–958. IEEE (2020)

    Google Scholar 

  47. Sui, Z., Chang, H., Xu, N., Jenkins, O.C.: Geofusion: geometric consistency informed scene estimation in dense clutter. IEEE Robot. Autom. Lette. 5(4), 5913–5920 (2020)

    Article  Google Scholar 

  48. Szot, A., et al.: Habitat 2.0: training home assistants to rearrange their habitat. In: Advances in Neural Information Processing Systems, vol. 34, 251–266 (2021)

    Google Scholar 

  49. Thompson, R., Ghalebi, E., DeVries, T., Taylor, G.W.: Building LEGO using deep generative models of graphs. arXiv preprint arXiv:2012.11543 (2020)

  50. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  51. Vineet, V., et al.: Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 75–82. IEEE (2015)

    Google Scholar 

  52. Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part localization with humans in the loop. In: 2011 International Conference on Computer Vision, pp. 2524–2531. IEEE (2011)

    Google Scholar 

  53. Wani, S., Patel, S., Jain, U., Chang, A.X., Savva, M.: Multi-on: benchmarking semantic map memory using multi-object navigation. In: Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  54. Willis, K.D., et al.: Fusion 360 gallery: a dataset and environment for programmatic cad reconstruction. arXiv preprint arXiv:2010.02392 (2020)

  55. Winograd, T.: SHRDLU: a system for dialog (1972)

    Google Scholar 

  56. Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: learning physical object properties from unlabeled videos. In: BMVC, vol. 2, p. 7 (2016)

    Google Scholar 

  57. Wu, J., Lu, E., Kohli, P., Freeman, B., Tenenbaum, J.: Learning to see physics via visual de-animation. In: NIPS, pp. 153–164 (2017)

    Google Scholar 

  58. Wu, J., Yildirim, I., Lim, J.J., Freeman, B., Tenenbaum, J.: Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: Advances in Neural Information Processing Systems, vol. 28, pp. 127–135 (2015)

    Google Scholar 

  59. Yi, K., et al.: CLEVRER: collision events for video representation and reasoning (2020)

    Google Scholar 

  60. Zhao, X., Agrawal, H., Batra, D., Schwing, A.: The surprising effectiveness of visual odometry techniques for embodied PointGoal navigation. In: Proceedings of the ICCV (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aaron Walsman .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 588 KB)

Supplementary material 2 (pdf 3369 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Walsman, A., Zhang, M., Kotar, K., Desingh, K., Farhadi, A., Fox, D. (2022). Break and Make: Interactive Structural Understanding Using LEGO Bricks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13688. Springer, Cham. https://doi.org/10.1007/978-3-031-19815-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19815-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19814-4

  • Online ISBN: 978-3-031-19815-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics