Moscow, December 5, 16:00 (GMT +3)
Will be broadcasted

Language-guided Visual Navigation and Manipulation

Computer vision has recently excelled in a wide range of tasks, such as image classification, segmentation, and captioning. This impressive progress now powers many internet imaging applications, yet current methods still fail to address the embodied understanding of visual scenes. What will happen if a glass is pushed over the table border? What precise actions are required to plant a tree? Building systems that can answer such questions from visual inputs will empower future robotics and personal visual assistant applications while enabling these applications to operate in unstructured real-world environments. Following this motivation, we will address models and learning methods for visual navigation and manipulation in this talk. In particular, we will focus on agents that perform tasks according to natural language instructions. We will discuss representations and models combining heterogeneous inputs such as language, multi-view observations, and history and demonstrate state-of-the-art results on several recent benchmarks.

Rate the presentation