Key Points
1. Large language models (LLMs) like ChatGPT have become powerful task solvers through conversational interactions, but most only support text-based interactions, limiting their applications.
2. The emergence of GPT-4o has made it possible to interact with LLMs through speech, enhancing the user experience with extremely low latency.
3. The simplest way to enable speech interaction with LLMs is through a cascaded system with automatic speech recognition (ASR) and text-to-speech (TTS), but this tends to have higher latency.
4. Some multimodal speech-language models can generate speech responses directly from speech instructions without intermediate text, achieving low latency, but direct speech-to-speech generation can be challenging.
5. LLaMA-Omni is a novel model architecture that integrates a speech encoder, a speech adaptor, an LLM, and a streaming speech decoder to enable low-latency and high-quality speech interaction.
6. LLaMA-Omni eliminates the need for speech transcription and can simultaneously generate text and speech responses directly from speech instructions.
7. To align the model with speech interaction scenarios, a dataset named InstructS2S-200K is constructed, containing 200K speech instructions and corresponding speech responses.
8. Experimental results show that LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms.
9. Training LLaMA-Omni takes less than 3 days on just 4 GPUs, enabling efficient development of speech-language models based on the latest LLMs.
Summary
This paper introduces LLaMA-Omni, a novel model architecture designed to enable real-time speech interaction with large language models (LLMs). The key features of LLaMA-Omni are: 1. Integrated Architecture: LLaMA-Omni consists of a speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. This integrated architecture allows the model to directly process speech input and generate both text and speech responses, eliminating the need for intermediate speech transcription. 2. Simultaneous Text and Speech Generation: During inference, as the LLM autoregressively generates the text response, the speech decoder simultaneously generates the corresponding discrete speech units. This enables extremely low latency responses, with the system able to produce speech outputs as quickly as 226ms after receiving the speech input. 3. Instruction-Aligned Dataset: To better align the model with speech interaction scenarios, the authors construct a dataset called InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. This dataset is used to train LLaMA-Omni.
The experimental results show that compared to previous speech-language models, LLaMA-Omni provides superior responses in both content and style, while maintaining the low-latency advantage. Additionally, the training of LLaMA-Omni takes less than 3 days on just 4 GPUs, demonstrating its efficiency and potential for rapid development of speech interaction models based on the latest LLMs.
The key innovation of this work is the integrated model architecture that enables the simultaneous generation of text and speech responses from speech input, achieving low-latency performance without compromising response quality. By leveraging the latest LLM capabilities and constructing a dataset tailored for speech interaction, the authors have made an important contribution towards enabling seamless speech-based interactions with large language models.
Reference: https://arxiv.org/abs/2409.066...