Command Palette
Search for a command to run...
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li Yuanhan Zhang Dong Guo Renrui Zhang Feng Li Hao Zhang Kaichen Zhang Yanwei Li Ziwei Liu Chunyuan Li
Abstract
We present LLaVA-OneVision, a family of open large multimodal models (LMMs)developed by consolidating our insights into data, models, and visualrepresentations in the LLaVA-NeXT blog series. Our experimental resultsdemonstrate that LLaVA-OneVision is the first single model that cansimultaneously push the performance boundaries of open LMMs in three importantcomputer vision scenarios: single-image, multi-image, and video scenarios.Importantly, the design of LLaVA-OneVision allows strong transfer learningacross different modalities/scenarios, yielding new emerging capabilities. Inparticular, strong video understanding and cross-scenario capabilities aredemonstrated through task transfer from images to videos.