Command Palette
Search for a command to run...
Video models are zero-shot learners and reasoners
Video models are zero-shot learners and reasoners
Thaddäus Wiedemer Yuxuan Li Paul Vicol Shixiang Shane Gu Nick Matarese Kevin Swersky Been Kim Priyank Jaini Robert Geirhos
Abstract
The remarkable zero-shot capabilities of Large Language Models (LLMs) havepropelled natural language processing from task-specific models to unified,generalist foundation models. This transformation emerged from simpleprimitives: large, generative models trained on web-scale data. Curiously, thesame primitives apply to today's generative video models. Could video models beon a trajectory towards general-purpose vision understanding, much like LLMsdeveloped general-purpose language understanding? We demonstrate that Veo 3 cansolve a broad variety of tasks it wasn't explicitly trained for: segmentingobjects, detecting edges, editing images, understanding physical properties,recognizing object affordances, simulating tool use, and more. These abilitiesto perceive, model, and manipulate the visual world enable early forms ofvisual reasoning like maze and symmetry solving. Veo's emergent zero-shotcapabilities indicate that video models are on a path to becoming unified,generalist vision foundation models.