An error occurred in the Server Components render. The specific message is omitted in production builds to avoid leaking sensitive details. A digest property is included on this error instance which may provide additional details about the nature of the error.

Failed to load notebook details

1. Tutorial Introduction

MiMo-Audio is an end-to-end speech model released by Xiaomi in September 2025. Its pre-training data has been expanded to over 100 million hours, and researchers have observed its few-shot learning capabilities across various audio tasks. The team systematically evaluated these capabilities, finding that MiMo-Audio-7B-Base achieved state-of-the-art (SOTA) performance in both open-source model speech intelligence and audio understanding benchmarks. In addition to standard metrics, the model can generalize to tasks not covered in the training data, such as speech conversion, style transfer, and speech editing. Furthermore, MiMo-Audio-7B-Base possesses powerful speech continuation capabilities, generating highly realistic talk show, recitation, live streaming, and debate content. In the post-training phase, researchers compiled a diverse set of instruction fine-tuning corpora and introduced thought mechanisms into audio understanding and generation. The resulting MiMo-Audio-7B-Instruct achieved state-of-the-art performance in open-source benchmarks for audio understanding, spoken dialogue, and instruction-TTS, and in some scenarios, it approached or surpassed closed-source models. The relevant research results are MiMo-Audio-Technical-Report .

This tutorial uses a single RTX 5090 graphics card as computing resource.

2. Effect Examples

1. 🔊 Audio Understanding

2. 🎵 Audio Generation Text-to-Speech

3. 🎤 Spoken Dialogue

4. 💬 S2T Dialogue

5. 📝 Text-to-Text Dialogue

3. Operation steps

1. Start the container

2. Initialize weight parameters

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

When using the Safari browser, the audio may not be played directly and needs to be downloaded before playing.

3. Audio Understanding

4. Audio Generation

5. Voice Conversation

6. Voice-to-text conversation

7. Text-to-text conversation

Citation Information

@misc{coreteam2025mimoaudio,
      title={MiMo-Audio: Audio Language Models are Few-Shot Learners}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      url={https://github.com/XiaomiMiMo/MiMo-Audio}, 
}

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Run this Notebook Discuss on Discord

Failed to load notebook details

1. Tutorial Introduction

This tutorial uses a single RTX 5090 graphics card as computing resource.

2. Effect Examples

1. 🔊 Audio Understanding

2. 🎵 Audio Generation Text-to-Speech

3. 🎤 Spoken Dialogue

4. 💬 S2T Dialogue

5. 📝 Text-to-Text Dialogue

3. Operation steps

1. Start the container

2. Initialize weight parameters

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

When using the Safari browser, the audio may not be played directly and needs to be downloaded before playing.

3. Audio Understanding

4. Audio Generation

5. Voice Conversation

6. Voice-to-text conversation

7. Text-to-text conversation

Citation Information

@misc{coreteam2025mimoaudio,
      title={MiMo-Audio: Audio Language Models are Few-Shot Learners}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      url={https://github.com/XiaomiMiMo/MiMo-Audio}, 
}

Related Notebooks

IndexTTS-2: Breaking Through the Bottlenecks of Autoregressive TTS Duration and Emotion Control

2 months ago

Dia2-TTS: Real-time Speech Synthesis Service

a month ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Run this Notebook Discuss on Discord

Failed to load notebook details

1. Tutorial Introduction

This tutorial uses a single RTX 5090 graphics card as computing resource.

2. Effect Examples

1. 🔊 Audio Understanding

2. 🎵 Audio Generation Text-to-Speech

3. 🎤 Spoken Dialogue

4. 💬 S2T Dialogue

5. 📝 Text-to-Text Dialogue

3. Operation steps

1. Start the container

2. Initialize weight parameters

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 2-3 minutes and refresh the page.

When using the Safari browser, the audio may not be played directly and needs to be downloaded before playing.

3. Audio Understanding

4. Audio Generation

5. Voice Conversation

6. Voice-to-text conversation

7. Text-to-text conversation

Citation Information

@misc{coreteam2025mimoaudio,
      title={MiMo-Audio: Audio Language Models are Few-Shot Learners}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      url={https://github.com/XiaomiMiMo/MiMo-Audio}, 
}

Related Notebooks

HuMo-1.7B: A Framework for Multimodal Video Generation

3 months ago

Open-AutoGLM: Smart Assistant for Mobile Devices

a month ago

One-click Deployment of Ministry-3-14B-Instruct

a month ago

HuMo-17B: Trimodal Collaborative Creation

3 months ago

DiffVox: Sound Differentiation Model

2 months ago

MOSS: Text-to-Spoken Dialogue Generation

24 days ago

UserLM-8b: User Dialogue Simulation Model

3 months ago

IndexTTS-2: Breaking Through the Bottlenecks of Autoregressive TTS Duration and Emotion Control

2 months ago

Dia2-TTS: Real-time Speech Synthesis Service

a month ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MiMo-Audio-7B-Instruct: Xiaomi's Open Source end-to-end Voice Model

1. Tutorial Introduction

2. Effect Examples

1. 🔊 Audio Understanding

2. 🎵 Audio Generation Text-to-Speech

3. 🎤 Spoken Dialogue

4. 💬 S2T Dialogue

5. 📝 Text-to-Text Dialogue

3. Operation steps

1. Start the container

2. Initialize weight parameters

3. Audio Understanding

4. Audio Generation

5. Voice Conversation

6. Voice-to-text conversation

7. Text-to-text conversation

Citation Information

Build AI with AI

HyperAI Newsletters

Command Palette

MiMo-Audio-7B-Instruct: Xiaomi's Open Source end-to-end Voice Model

1. Tutorial Introduction

2. Effect Examples

1. 🔊 Audio Understanding

2. 🎵 Audio Generation Text-to-Speech

3. 🎤 Spoken Dialogue

4. 💬 S2T Dialogue

5. 📝 Text-to-Text Dialogue

3. Operation steps

1. Start the container

2. Initialize weight parameters

3. Audio Understanding

4. Audio Generation

5. Voice Conversation

6. Voice-to-text conversation

7. Text-to-text conversation

Citation Information

Related Notebooks

HuMo-1.7B: A Framework for Multimodal Video Generation

Open-AutoGLM: Smart Assistant for Mobile Devices

One-click Deployment of Ministry-3-14B-Instruct

HuMo-17B: Trimodal Collaborative Creation

DiffVox: Sound Differentiation Model

MOSS: Text-to-Spoken Dialogue Generation

UserLM-8b: User Dialogue Simulation Model

IndexTTS-2: Breaking Through the Bottlenecks of Autoregressive TTS Duration and Emotion Control

Dia2-TTS: Real-time Speech Synthesis Service

Build AI with AI

HyperAI Newsletters

Command Palette

MiMo-Audio-7B-Instruct: Xiaomi's Open Source end-to-end Voice Model

1. Tutorial Introduction

2. Effect Examples

1. 🔊 Audio Understanding

2. 🎵 Audio Generation Text-to-Speech

3. 🎤 Spoken Dialogue

4. 💬 S2T Dialogue

5. 📝 Text-to-Text Dialogue

3. Operation steps

1. Start the container

2. Initialize weight parameters

3. Audio Understanding

4. Audio Generation

5. Voice Conversation

6. Voice-to-text conversation

7. Text-to-text conversation

Citation Information

Related Notebooks

HuMo-1.7B: A Framework for Multimodal Video Generation

Open-AutoGLM: Smart Assistant for Mobile Devices

One-click Deployment of Ministry-3-14B-Instruct

HuMo-17B: Trimodal Collaborative Creation

DiffVox: Sound Differentiation Model

MOSS: Text-to-Spoken Dialogue Generation

UserLM-8b: User Dialogue Simulation Model

IndexTTS-2: Breaking Through the Bottlenecks of Autoregressive TTS Duration and Emotion Control

Dia2-TTS: Real-time Speech Synthesis Service

Build AI with AI

HyperAI Newsletters