4 months ago

Multimodal Representation

Yuhao Zhang Yuhao Du Zhanchen Dai Xiangnan Ma Kaiqi Kou Benyou Wang Haizhou Li

Abstract

Speech-to-speech large language models (SLLMs) are attracting increasingattention. Derived from text-based large language models (LLMs), SLLMs oftenexhibit degradation in knowledge and reasoning capabilities. We hypothesizethat this limitation arises because current training paradigms for SLLMs failto bridge the acoustic-semantic gap in the feature representation space. Toaddress this issue, we propose EchoX, which leverages semantic representationsand dynamically generates speech training targets. This approach integratesboth acoustic and semantic learning, enabling EchoX to preserve strongreasoning abilities as a speech LLM. Experimental results demonstrate thatEchoX, with about six thousand hours of training data, achieves advancedperformance on multiple knowledge-based question-answering benchmarks. Theproject is available at https://github.com/FreedomIntelligence/EchoX.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

4 months ago

Multimodal Representation

Yuhao Zhang Yuhao Du Zhanchen Dai Xiangnan Ma Kaiqi Kou Benyou Wang Haizhou Li

Abstract

Speech-to-speech large language models (SLLMs) are attracting increasingattention. Derived from text-based large language models (LLMs), SLLMs oftenexhibit degradation in knowledge and reasoning capabilities. We hypothesizethat this limitation arises because current training paradigms for SLLMs failto bridge the acoustic-semantic gap in the feature representation space. Toaddress this issue, we propose EchoX, which leverages semantic representationsand dynamically generates speech training targets. This approach integratesboth acoustic and semantic learning, enabling EchoX to preserve strongreasoning abilities as a speech LLM. Experimental results demonstrate thatEchoX, with about six thousand hours of training data, achieves advancedperformance on multiple knowledge-based question-answering benchmarks. Theproject is available at https://github.com/FreedomIntelligence/EchoX.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp