Date

6 months ago

Size

1.62 GB

1. Tutorial Introduction

OmniGen2 is an open-source multimodal generative model released by the Beijing Academy of Artificial Intelligence (BAAI) on June 16, 2025. It aims to provide a unified solution for various generative tasks, including text-to-image generation, image editing, and context generation. Unlike OmniGen v1, OmniGen2 designs two independent decoding paths for text and image modalities, employing non-shared parameters and separate image segmenters. This design allows OmniGen2 to be built upon existing multimodal understanding models without needing to re-adapt to VAE inputs, thus retaining its original text generation capabilities. Its core innovations lie in its dual-path architecture and self-reflection mechanism, setting a new benchmark for current open-source multimodal models. Related research papers are available. OmniGen2: Exploration to Advanced Multimodal Generation .

The computing resources of this tutorial use a single RTX A6000 card, and the English prompts are currently more effective.

2. Effect display

Some examples of effects with OmniGen2:

OmniGen2 Image Editing Function Demonstration

OmniGen2 context generation feature demonstration

3. Operation steps

1. Start the container

2. Usage steps

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

The first example is image description, the second and third examples are viz images, and the remaining examples are image editing.

Specific parameters:

Height: height.
Width: width.
Text Guidance Scale: Text guidance scale.
Image Guidance Scale: Image guidance scale.
CFG Range Start: Range start.
CFG Range End: Range end.
Scheduler: Scheduler.
Inference Steps: Inference steps.
Number of images per prompt: The number of images per prompt.
Seed: seed.
max_input_image_side_length: Maximum input image side length.
max_pixels: Maximum pixels.

result

4. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information

The citation information for this project is as follows:

@article{wu2025omnigen2,
  title={OmniGen2: Exploration to Advanced Multimodal Generation},
  author={Chenyuan Wu and Pengfei Zheng and Ruiran Yan and Shitao Xiao and Xin Luo and Yueze Wang and Wanli Li and Xiyan Jiang and Yexin Liu and Junjie Zhou and Ze Liu and Ziyi Xia and Chaofan Li and Haoge Deng and Jiahao Wang and Kun Luo and Bo Zhang and Defu Lian and Xinlong Wang and Zhongyuan Wang and Tiejun Huang and Zheng Liu},
  journal={arXiv preprint arXiv:2506.18871},
  year={2025}
}

Related Notebooks

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

2 months ago

PaddleOCR-VL: Multimodal Document Parsing

2 months ago

HunyuanImage-2.1: Diffusion Model for high-resolution (2K) Hunyuan Images

3 months ago

Open-AutoGLM: Smart Assistant for Mobile Devices

a month ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Run this Notebook

Date

6 months ago

Size

1.62 GB

1. Tutorial Introduction

The computing resources of this tutorial use a single RTX A6000 card, and the English prompts are currently more effective.

2. Effect display

Some examples of effects with OmniGen2:

3. Operation steps

1. Start the container

2. Usage steps

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

The first example is image description, the second and third examples are viz images, and the remaining examples are image editing.

Specific parameters:

Height: height.
Width: width.
Text Guidance Scale: Text guidance scale.
Image Guidance Scale: Image guidance scale.
CFG Range Start: Range start.
CFG Range End: Range end.
Scheduler: Scheduler.
Inference Steps: Inference steps.
Number of images per prompt: The number of images per prompt.
Seed: seed.
max_input_image_side_length: Maximum input image side length.
max_pixels: Maximum pixels.

result

4. Discussion

Citation Information

The citation information for this project is as follows:

@article{wu2025omnigen2,
  title={OmniGen2: Exploration to Advanced Multimodal Generation},
  author={Chenyuan Wu and Pengfei Zheng and Ruiran Yan and Shitao Xiao and Xin Luo and Yueze Wang and Wanli Li and Xiyan Jiang and Yexin Liu and Junjie Zhou and Ze Liu and Ziyi Xia and Chaofan Li and Haoge Deng and Jiahao Wang and Kun Luo and Bo Zhang and Defu Lian and Xinlong Wang and Zhongyuan Wang and Tiejun Huang and Zheng Liu},
  journal={arXiv preprint arXiv:2506.18871},
  year={2025}
}

Related Notebooks

Wan2.2-Animate-14B: Open Advanced Large-Scale Video Generation Model

3 months ago

Ovis-Image: High-quality Image Generation Model

a month ago

HuMo-1.7B: A Framework for Multimodal Video Generation

3 months ago

HuMo-17B: Trimodal Collaborative Creation

3 months ago

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

a month ago

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

2 months ago

PaddleOCR-VL: Multimodal Document Parsing

2 months ago

HunyuanImage-2.1: Diffusion Model for high-resolution (2K) Hunyuan Images

3 months ago

Open-AutoGLM: Smart Assistant for Mobile Devices

a month ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

OmniGen2: Exploring Advanced Multimodal Generation

1. Tutorial Introduction

2. Effect display

3. Operation steps

1. Start the container

2. Usage steps

4. Discussion

Citation Information

Build AI with AI

HyperAI Newsletters

Command Palette

OmniGen2: Exploring Advanced Multimodal Generation

1. Tutorial Introduction

2. Effect display

3. Operation steps

1. Start the container

2. Usage steps

4. Discussion

Citation Information

Related Notebooks

Wan2.2-Animate-14B: Open Advanced Large-Scale Video Generation Model

Ovis-Image: High-quality Image Generation Model

HuMo-1.7B: A Framework for Multimodal Video Generation

HuMo-17B: Trimodal Collaborative Creation

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

PaddleOCR-VL: Multimodal Document Parsing

HunyuanImage-2.1: Diffusion Model for high-resolution (2K) Hunyuan Images

Open-AutoGLM: Smart Assistant for Mobile Devices

Build AI with AI

HyperAI Newsletters

Command Palette

OmniGen2: Exploring Advanced Multimodal Generation

1. Tutorial Introduction

2. Effect display

3. Operation steps

1. Start the container

2. Usage steps

4. Discussion

Citation Information

Related Notebooks

Wan2.2-Animate-14B: Open Advanced Large-Scale Video Generation Model

Ovis-Image: High-quality Image Generation Model

HuMo-1.7B: A Framework for Multimodal Video Generation

HuMo-17B: Trimodal Collaborative Creation

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

PaddleOCR-VL: Multimodal Document Parsing

HunyuanImage-2.1: Diffusion Model for high-resolution (2K) Hunyuan Images

Open-AutoGLM: Smart Assistant for Mobile Devices

Build AI with AI

HyperAI Newsletters

Related Notebooks

Wan2.2-Animate-14B: Open Advanced Large-Scale Video Generation Model

Ovis-Image: High-quality Image Generation Model

HuMo-1.7B: A Framework for Multimodal Video Generation

HuMo-17B: Trimodal Collaborative Creation

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

PaddleOCR-VL: Multimodal Document Parsing

HunyuanImage-2.1: Diffusion Model for high-resolution (2K) Hunyuan Images

Open-AutoGLM: Smart Assistant for Mobile Devices

Related Notebooks

Wan2.2-Animate-14B: Open Advanced Large-Scale Video Generation Model

Ovis-Image: High-quality Image Generation Model

HuMo-1.7B: A Framework for Multimodal Video Generation

HuMo-17B: Trimodal Collaborative Creation

VibeVoice-Realtime TTS: Real-time Speech Synthesis Service

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

PaddleOCR-VL: Multimodal Document Parsing

HunyuanImage-2.1: Diffusion Model for high-resolution (2K) Hunyuan Images

Open-AutoGLM: Smart Assistant for Mobile Devices