Command Palette
Search for a command to run...
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for
Speech-to-Speech LLMs
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Yuhao Zhang Yuhao Du Zhanchen Dai Xiangnan Ma Kaiqi Kou Benyou Wang Haizhou Li
Abstract
Speech-to-speech large language models (SLLMs) are attracting increasingattention. Derived from text-based large language models (LLMs), SLLMs oftenexhibit degradation in knowledge and reasoning capabilities. We hypothesizethat this limitation arises because current training paradigms for SLLMs failto bridge the acoustic-semantic gap in the feature representation space. Toaddress this issue, we propose EchoX, which leverages semantic representationsand dynamically generates speech training targets. This approach integratesboth acoustic and semantic learning, enabling EchoX to preserve strongreasoning abilities as a speech LLM. Experimental results demonstrate thatEchoX, with about six thousand hours of training data, achieves advancedperformance on multiple knowledge-based question-answering benchmarks. Theproject is available at https://github.com/FreedomIntelligence/EchoX.