Mini-Omni: Real-Time Voice AI Model Supports 'Thinking While Talking'

type

status

date

slug

summary

What problem does Mini-Omni solve?

Real-time voice interaction delay problem: Traditional models usually rely on a two-step process of generating text and then converting it into voice when generating voice, which causes significant delays and affects user experience. Mini-Omni uses parallel generation technology to generate text and voice at the same time, greatly reducing response time and achieving true real-time voice interaction.

Integration of speech and text reasoning capabilities: Most existing large language models perform well in text reasoning, but are relatively weak in speech reasoning. Mini-Omni retains the strong capabilities of language models in text reasoning through innovative training methods and model architectures, and extends these capabilities to speech processing and generation.

Reduce model complexity and resource requirements: Mini-Omni simplifies the process of integrating speech capabilities into large language models through the "Any Model Can Talk" approach. This approach requires less additional training data and model adjustments, allowing other models to quickly have speech interaction capabilities, reducing resource and time consumption.

Main Features of Mini-Omni

Real-time voice input and output

Mini-Omni can process voice input and generate voice output simultaneously, achieving true end-to-end voice interaction. This means that users can have a voice conversation with the model, and the model can respond immediately without the delay steps of text generation and speech conversion.

Thinking while speaking: The model can reason and think while generating speech, reducing latency and improving the fluency of conversation. This means that it can process information while outputting speech when it has not yet fully calculated the complete answer.

Supports continuous voice streaming output and is suitable for interactive scenarios that require real-time feedback, such as voice assistants and intelligent customer service.

Speech Recognition and Generation

Mini-Omni is equipped with automatic speech recognition (ASR) function, which can convert the user's voice input into text for processing.

At the same time, it also has the ability to generate speech, which can generate speech from text or reasoning results and communicate directly with users.

Multimodal Understanding and Generation

Mini-Omni supports not only voice but also multi-modal input such as text. It can convert between different modes, such as generating text through voice and generating voice through text.

Parallel generation technology

Through parallel generation technology, Mini-Omni can generate text and voice responses simultaneously, greatly reducing the delay problem of voice output and ensuring efficient real-time conversation capabilities.

“Any Model Can Talk” approach

This feature enables existing large language models to quickly have voice input and output capabilities. With minimal data and architecture adjustments, Mini-Omni provides a simple solution for integrating voice capabilities into other models, helping them achieve voice interaction functions.

Batch Parallel Inference

To further improve the performance of the model in speech reasoning tasks, Mini-Omni adopts a batch parallel reasoning method, which can maintain the complexity and accuracy of text reasoning while generating speech.

VoiceAssistant-400K dataset supports

Mini-Omni uses a VoiceAssistant-400K dataset designed specifically for voice assistants to optimize the model's performance in voice assistant scenarios. This dataset is used to train the model's voice question-answering and conversation capabilities, making it more adaptable in voice assistant applications.

Technical Approach of Mini-Omni

End-to-end speech generation architecture

Mini-Omni adopts an end-to-end voice input and output architecture, directly from voice input to voice output, avoiding the traditional voice-to-text and voice-to-text steps, greatly reducing latency, and providing real-time voice conversation capabilities.

Thinking while talking

Concept: During a conversation, Mini-Omni is able to generate audio while reasoning. This “thinking and talking” feature is achieved through delayed parallel generation.

Working mechanism: When the model generates each layer of audio tokens, it uses a delay technique to generate text tokens first and then gradually generate audio tokens. After generating the first layer of text tokens, the multi-layer codebook of the audio encoder SNAC starts to generate audio tokens in parallel. This allows the model to generate high-quality audio with a short delay.

Technical advantages: By delaying parallel generation, the model effectively solves the complexity of generating text and audio simultaneously while ensuring high-quality audio output.

Parallel generation technology

Mini-Omni uses a parallel generation strategy to generate text and voice responses simultaneously, reducing the time it takes to generate speech and ensuring that users get feedback almost instantly. Parallel generation can also flexibly handle tasks in different modalities.

Core idea: Mini-Omni introduces a text-guided parallel generation strategy, which generates text while generating audio, and uses text reasoning capabilities to improve the accuracy of audio generation.

Implementation: The model assumes that text has a high information density, so it achieves simultaneous output of text and audio by generating text first and then generating audio. Before generating audio, the model will first generate corresponding text tokens based on the text generation part, and then generate audio tokens. This reduces the waiting time for audio generation and achieves synchronous output of voice and text.Technical advantages : The parallel generation strategy solves the delay problem caused by the traditional method of first generating text and then generating audio, greatly speeding up the speech generation speed and improving real-time performance.

Batch Parallel Inference

Core idea: Batch parallel generation technology is used to improve the efficiency and accuracy of the model when processing audio and text reasoning tasks.

Implementation: This technology processes the model's input in batches, and each input sample needs to generate both text and audio. During the inference process, the model not only generates text output, but also generates corresponding audio output based on it. In order to enhance the model's reasoning ability when generating audio, two parallel sample generation strategies are adopted: one generates text and the other generates audio. In this process, the text output is embedded in the audio generation sample, forming a stronger reasoning ability.

Technical advantages: This method effectively utilizes the model's powerful capabilities in text reasoning and transfers them to audio generation, greatly improving the model's performance in processing audio reasoning tasks with lower requirements for computing resources.

SNAC Audio Codec

Core technology: Mini-Omni uses the SNAC audio encoder, which is an efficient music-grade encoder with an 8-layer codebook structure that can process a large number of audio tokens in a short time.

How it works: The SNAC encoder efficiently encodes audio and discretizes the audio signal into multiple levels of codebooks. This encoding method greatly reduces the complexity of the model when processing audio while ensuring that the generated audio has high fidelity.

Technical advantages: Through its multi-layer structure, the SNAC encoder allows the model to remain efficient in generating high-quality audio, avoiding the audio quality degradation problem commonly caused by low-bitrate encoders.

“Any Model Can Talk” Approach

Concept: This is an innovative training and inference method designed to help other large language models quickly adapt to speech output capabilities.

Implementation: This method is divided into three stages:

Modality alignment: First, align the model’s text and audio to ensure that the model can understand and generate speech. In this phase, the model is initially trained using speech recognition and speech synthesis data to improve its speech processing capabilities.

Adaptive training: Once the audio and text modalities are aligned, the model begins to focus on generating text given the audio input, and the audio output is achieved through simple text-to-audio synthesis. This stage uses data from speech question answering (Speech QA) and text question answering (Text QA) for training.

Multimodal fine-tuning: In the final stage, all weights of the model are unfrozen and comprehensive fine-tuning is performed using multimodal data to ensure that the model remains efficient in multimodal interactions.

Technical advantages: This method greatly reduces the training cost of the model, allowing other language models to quickly obtain voice interaction capabilities with a small amount of additional data without making major changes to the model architecture.

Text directives delay parallel generation

By adopting a text instruction delayed parallel generation strategy, the model first generates text and then generates speech based on the text. The efficiency of text reasoning is used to reduce the complexity of speech generation while maintaining high-quality speech output.

Audio Discretization and Coding

Mini-Omni uses audio discretization technology to convert speech signals into discrete audio tokens for inference processing in language models. The SNAC encoder is used to ensure high-quality speech generation.

Audio encoding: Mini-Omni uses advanced speech encoding technologies such as Whisper to discretize audio input into tokens for model processing.

Audio decoding: When generating audio, the model decodes audio tokens through a multi-layer codebook technique to ensure that the generated audio is of high quality and low latency.

Three-stage training framework

The training process of Mini-Omni is divided into three stages:

Modality expansion phase: training the model’s speech recognition and generation capabilities.

Adaptation training phase: Use speech recognition and text generation task data to further optimize the model's speech understanding and text generation capabilities.

Comprehensive fine-tuning stage: Perform multimodal fine-tuning on the model, optimize the voice output quality, and achieve flexible switching between voice and text.

three-stage training phases of Mini-Omni — three-stage training phases of **Mini-Omni**

VoiceAssistant-400K Dataset of Mini-Omni

Overview: In order to optimize the model's voice output capabilities, the Mini-Omni team created a dedicated dataset VoiceAssistant-400K. This dataset is synthesized by GPT-4o and contains 400,000 entries specifically for training voice assistants, ensuring that the model can generate natural and fluent voice output when generating voice assistant-style conversations.

1. Data sources and generation methods

Generation method: The VoiceAssistant-400K dataset is generated by the GPT-4o model. GPT-4o generates more than 400,000 supervised data (Supervised Fine-Tuning, SFT) for voice assistant training.

Data content: The dataset includes voice question-and-answer conversations in a variety of voice assistant scenarios. Each entry includes not only the question and answer in text form, but also the corresponding audio content, ensuring that the model can perform effective reasoning and generation in the scenarios of voice input and output.

Purpose: Mainly used to train the voice assistant function of Mini-Omni, to help the model avoid generating code symbols and lengthy text when generating speech, and to ensure that the generated speech output is natural and fluent.

2. Size of the dataset

Data volume: The VoiceAssistant-400K dataset contains more than 400,000 voice conversation data. This data volume is sufficient to cover various common voice assistant scenarios, ensuring that the model can perform efficient voice interaction in a variety of scenarios after training.

Multimodal data: The dataset not only covers the correspondence between text and speech, but also includes multimodal input and output, ensuring that the model can provide corresponding speech output when faced with text or speech input.

3. Application scenarios

Voice Assistant Optimization: This dataset is designed to fine-tune voice assistants and train natural and coherent voice dialogue models that can handle user questions, instructions, and generate voice feedback.

Other applications: In addition to voice assistants, this dataset can also be applied to other scenarios that require speech generation and understanding, such as intelligent customer service systems, real-time speech translation systems, etc.

4. Technical features

Supervised fine-tuning (SFT): The VoiceAssistant-400K dataset is specifically designed for supervised fine-tuning to ensure that the model can effectively learn voice assistant-style conversational patterns. Through supervised learning, the model can not only improve speech comprehension, but also further strengthen reasoning and response capabilities through question-answer pairs in the data.

Avoid generating redundant information: During the generation process, the dataset specifically optimizes the model to not include code symbols or overly long text when generating voice output, ensuring that the conversation is concise, accurate, and close to the real voice assistant experience.

5. Contribution of the dataset

Accelerate voice assistant model training: With this dataset, the Mini-Omni model can learn the skills required by voice assistants faster and more accurately, thereby reducing model training time and improving the naturalness of voice interaction.

Improving the practicality of multimodal models: VoiceAssistant-400K not only provides strong data support for voice assistants, but also provides effective training data for multimodal models (including text, audio input and output), making them perform better in multimodal tasks.

Experimental Results of Mini-Omni

The experimental results of Mini-Omni mainly demonstrate the performance of the model in multimodal tasks, especially in core tasks such as speech recognition, speech generation, and voice question answering. The following is a detailed introduction to the experimental results:

real streaming output examples of Mini-Omni

Automatic Speech Recognition (ASR) results

Mini-Omni's performance on multiple speech recognition benchmarks evaluates its ability to understand audio input. The experimental results are as follows:

Test set : LibriSpeech dataset, divided into four parts: test-clean, test-other, dev-clean, and dev-other.

Evaluation indicator: error rate (WER, Word Error Rate).

method	test-clean	test-other	dev-clean	dev-other
wav2vec2-base	6.0	13.4	–	–
VITA	8.14	18.41	7.57	16.57
whisper-small	3.4	7.6	–	–
Mini-Omni	4.5	9.7	4.6	9.2

Conclusion: Although the speech recognition performance of Mini-Omni is slightly lower than that of Whisper-small, it still performs well compared with other methods, especially on test-cleanthe and dev-cleandatasets, where its performance is close to Whisper-small, showing a high speech understanding ability.

2. Speech QA and Text QA Results

One of the main innovations of Mini-Omni is its performance in multimodal tasks, especially speech question answering (Speech QA) and text question answering (Text QA). Here is the performance of the model in these two tasks:

Task Type:

Text QA: The model generates text answers based on text input.
Speech QA: The model generates spoken responses based on speech input, using a parallel generation strategy to achieve real-time response.

Performance evaluation: Mini-Omni demonstrated efficient reasoning capabilities when processing Text QA and Speech QA tasks, especially when using batch parallel generation technology, the reasoning performance of speech output was significantly improved.

3. Effect of batch parallel generation strategy

The batch parallel decoding strategy introduced by Mini-Omni improves the model's reasoning efficiency by generating text and audio simultaneously. Experimental results show that this strategy brings improvements in the following aspects:

Improved reasoning capabilities: Batch parallel generation extends the model's reasoning capabilities from text reasoning to speech generation, significantly improving the model's performance in voice question-answering tasks.

Improved audio quality: Through parallel generation technology, the model can generate higher quality audio, especially reducing latency in streaming output and improving user experience.

4. Speech Generation Quality Assessment

The speech generation quality of Mini-Omni is at a similar high level compared to traditional text-to-speech (TTS) systems. In the experiment, the following quality evaluation criteria were used:

Audio Clarity: With the SNAC audio encoder, the generated audio quality is on par with common TTS systems.

Latency test: Although there may be a slightly longer delay due to network reasons when using Gradio for demonstration, the overall generated audio is smooth and of high quality.

5. Performance Summary

Mini-Omni has demonstrated strong speech and text processing capabilities through testing on multiple tasks, especially in multimodal dialogue tasks, where its thinking-while-generating feature ensures the natural fluency of real-time interactions.

Summary

Speech recognition capability: Mini-Omni's performance in speech recognition is close to that of the mainstream Whisper-small model, indicating that it has strong speech understanding capabilities.

Speech generation capability: Through batch parallel generation and SNAC encoder, Mini-Omni can efficiently generate high-quality speech and significantly reduce the generation delay.

Reasoning performance: The batch parallel generation strategy significantly improves the reasoning efficiency of the model, especially its performance in multimodal tasks, enabling it to maintain consistent and efficient reasoning capabilities in both voice question answering and text question answering.

Main Contributions of Mini-Omni

End-to-end speech generation: Mini-Omni achieves real-time interaction between speech and text through a parallel generation strategy, reducing generation delays and making speech interaction more natural and smooth.

“Any Model Can Talk” approach: Provides a path for other language models to quickly expand into the field of voice interaction with only a small amount of data and minimal model modifications.

High-quality speech generation and multimodal reasoning: Mini-Omni not only performs well in speech recognition (ASR) and speech generation (TTS) tasks, but also has strong reasoning capabilities in multimodal tasks (such as TextQA and SpeechQA).

Hugging Face

Github

Technical report

🔥

Don't Let Server Issues Slow You Down. RackNerd: Your Solution