Computer vision has recently excelled in a wide range of tasks, such as image classification, segmentation, and captioning. This impressive progress now powers many internet imaging applications, yet current methods still fail to address the embodied understanding of visual scenes. What will happen if a glass is pushed over the table border? What precise actions are required to plant a tree? Building systems that can answer such questions from visual inputs will empower future robotics and personal visual assistant applications while enabling these applications to operate in unstructured