Command Palette
Search for a command to run...
Waver: Wave Your Way to Lifelike Video Generation
Waver: Wave Your Way to Lifelike Video Generation
Yifu Zhang Hao Yang Yuqi Zhang Yifei Hu Fengda Zhu Chuang Lin Xiaofeng Mei Yi Jiang Zehuan Yuan Bingyue Peng
Abstract
We present Waver, a high-performance foundation model for unified image andvideo generation. Waver can directly generate videos with durations rangingfrom 5 to 10 seconds at a native resolution of 720p, which are subsequentlyupscaled to 1080p. The model simultaneously supports text-to-video (T2V),image-to-video (I2V), and text-to-image (T2I) generation within a single,integrated framework. We introduce a Hybrid Stream DiT architecture to enhancemodality alignment and accelerate training convergence. To ensure training dataquality, we establish a comprehensive data curation pipeline and manuallyannotate and train an MLLM-based video quality model to filter for thehighest-quality samples. Furthermore, we provide detailed training andinference recipes to facilitate the generation of high-quality videos. Buildingon these contributions, Waver excels at capturing complex motion, achievingsuperior motion amplitude and temporal consistency in video synthesis. Notably,it ranks among the Top 3 on both the T2V and I2V leaderboards at ArtificialAnalysis (data as of 2025-07-30 10:00 GMT+8), consistently outperformingexisting open-source models and matching or surpassing state-of-the-artcommercial solutions. We hope this technical report will help the communitymore efficiently train high-quality video generation models and accelerateprogress in video generation technologies. Official page:https://github.com/FoundationVision/Waver.