Moshi: Open-Source Real-Time Speech Transformer

type

status

date

slug

summary

Key Features of Moshi

Real-time speech-to-speech conversations

Moshi generates audio output directly from audio input, rather than relying on the traditional speech-to-text-to-speech process. By processing speech data directly, Moshi preserves non-verbal information such as tone, emotion, overlapping speech, and interruptions, ensuring that conversations are more natural and fluid.

Full-duplex communication

Moshi is able to listen and speak simultaneously, meaning it can generate voice responses as the user speaks, without strict conversational turn-taking. It can handle complex conversational scenarios, such as overlapping speech and non-interruptive feedback (such as "hmm" or "I understand") that can be inserted at any time.

Low latency

Moshi is designed to have very low latency, theoretically only 160 milliseconds and in practice around 200 milliseconds. This means Moshi can respond to user input in close to real time, providing a smoother conversation experience.

Inner Monologue Method

Moshi predicts text tokens before generating speech, which significantly improves the language quality and consistency of the generated speech. This not only makes the generated speech clearer, but also improves the system's speech recognition and text-to-speech capabilities in streaming media environments. Moshi has implemented streaming speech recognition (ASR) and text-to-speech (TTS) functions by introducing the "inner monologue" mechanism, supporting simultaneous processing of language and audio in a continuous conversation flow.

Processing multiple audio streams in parallel

Moshi is able to process both the user and the system's voice streams simultaneously. This multi-stream processing capability allows Moshi to not only generate its own speech, but also understand and respond to the user's speech in real time.

Emotion and speech dynamics processing

By processing speech directly rather than intermediate text, Moshi is able to understand and generate emotionally charged speech and handle complex conversational dynamics such as emotional expressions, voice inflection, etc.

Support for complex conversation dynamics

Moshi is able to handle the complex dynamics of natural conversations, such as interruptions, interleaving, interjections, and responses. Traditional systems rely on clear conversational turns (one person speaks before the other takes a turn), but Moshi removes this limitation, making conversations more natural.

Model Architecture of Moshi

Moshi consists of three main parts: Helium, a 7B language model trained with 2.1 trillion tokens; Mimi, a neural audio codec that models semantic and acoustic information; and a new multi-stream architecture that models the user's and Moshi's audio separately.

By working together, these modules enable smooth full-duplex conversations, emotional expression, and the handling of complex conversational dynamics.

Helium Text Language Model

Helium is the core of Moshi. It is a text language model with 7 billion parameters based on the Transformer architecture (similar to GPT). Helium provides Moshi with powerful language understanding and generation capabilities, capable of handling complex text reasoning and dialogue tasks.

Its training data includes 2.1 trillion English words, giving it extensive knowledge and language capabilities.

Mimi Neural Audio Codec

Mimi is Moshi’s audio processing component. It is a neural network audio codec that is responsible for converting audio into discrete speech tokens and is able to generate high-quality speech output in reverse.

Mimi uses Residual Vector Quantization (RVQ) technology to encode speech data into discrete phonetic and semantic markers, ensuring high speech fidelity and language consistency.

By combining semantic and acoustic markers, Mimi can not only generate natural speech but also process complex speech context and emotional information.

Inner Monologue Method

The inner monologue method is a key technology for Moshi's speech generation, which allows the model to predict text tags that are synchronized with the audio before generating speech. This method not only improves the language quality of the generated speech, but also allows Moshi to achieve speech recognition and text-to-speech conversion functions in a streaming environment.

Synchronous generation of text and speech: Before generating audio, Moshi generates a text stream corresponding to its speech output. This text stream serves as the basis for speech generation, making the speech generation more accurate and helping to handle complex conversation scenarios.

Streaming compatibility: This approach allows Moshi to process speech while still enabling efficient speech recognition and text-to-speech (TTS) in a streaming environment.

The model architecture is designed to process multiple parallel audio streams and generate speech and text in real time. Moshi can generate system speech while processing user speech, which enables it to support uninterrupted natural conversations.

Detailed Technical Methods of Moshi

1. Speech-to-speech generation architecture

Moshi's core innovation is to treat voice conversation as a speech-to-speech generation task, rather than the traditional multi-component process of text-to-speech and then to speech. Traditional voice conversation systems include multiple independent modules such as voice activity detection (VAD), speech recognition (ASR), natural language understanding (NLU), natural language generation (NLG), and text-to-speech (TTS).

Moshi directly generates speech tokens so that speech does not rely on intermediate text representations during understanding and generation, thus avoiding the loss of information (such as emotion, tone, and non-verbal sounds).

2. Helium Text Language Model

Moshi is based on the Helium text language model, which is a large text generation model with 7B parameters. Helium is pre-trained with 2.1 trillion English data and has strong language understanding, reasoning and generation capabilities. It is the semantic understanding foundation of Moshi and supports complex natural language processing functions, including open-ended conversations and question-answering.

Key Features of Helium:

Autoregressive Transformer architecture: Moshi is based on Helium, a text language model based on the Transformer architecture. Similar to the classic Transformer, Helium uses a multi-layer attention mechanism and autoregressive modeling method to process text input and generate output. The model has 7B parameters, which is enough to support the learning of large-scale corpus.

RMS normalization: RMS normalization is used in the attention module, feedforward module, and output layer to improve the training stability of the model.

Rotated Positional Encoding (RoPE): Used to handle longer context windows (4096 tokens) to ensure that the model can capture long-range dependencies in the conversation.

Efficient FlashAttention: Through optimized attention calculation, model reasoning under long sequence input is more efficient.

3. Mimi Neural Audio Codec

Mimi is a neural audio codec for speech processing in Moshi. Its task is to discretize continuous speech signals into audio tokens . These discrete audio tokens are similar to text tokens and can represent detailed information in speech. Mimi uses residual vector quantization (RVQ) technology to retain high-quality audio at a lower bit rate, supporting real-time speech generation and processing.

Key technologies of Mimi:

Residual Vector Quantization (RVQ): Mimi uses multi-level residual vector quantization to discretize complex audio signals into multiple levels of audio tokens. This approach allows each time step to efficiently encode the semantic and acoustic information of speech while ensuring the quality of audio reconstruction.

Combination of semantic and acoustic tokens: The audio tokens used by Mimi include both semantic and acoustic information. Semantic tokens retain the content of the speech (such as the specific words spoken), while acoustic tokens describe the audio characteristics of the speech, such as timbre, emotion, and intonation.

Streaming encoding and decoding: Mimi supports streaming, which enables continuous speech generation and recognition in real-time conversations. This makes Moshi's response speed very close to natural conversation.

4. Architecture of RQ-Transformer

Moshi uses a multi-stream hierarchical generation architecture that can process multiple audio streams in parallel. Moshi achieves flexible interaction in the conversation by simultaneously modeling the user's voice stream and the system's own voice stream, allowing complex conversation dynamics such as interleaving, interruptions, and interjections between speakers.

This is an architecture previously proposed for discrete image generation, and makes it possible to model a hierarchy of semantic and acoustic tokens without increasing the length of the Helium sequence. This means that each second of audio only needs to be passed through the 7B backbone model 12.5 times, which can run in real time on an L4 or M3 Macbook pro ! Combined with MusicGen's token delay, this provides state-of-the-art performance for audio language modeling.

Hierarchical autoregressive modeling: Moshi uses RQ-Transformer (Residual Quantizer Transformer) to decompose audio tokens into multiple levels and generate audio through hierarchical autoregressive modeling. Specifically, the model first uses a larger Temporal Transformer to process the time series, and then uses a smaller Depth Transformer to process multiple subsequences at each time step. This design greatly improves the efficiency of generating long audio sequences.

Multimodal sequence generation: The model generates multiple sequences (including text, semantic tokens, and audio tokens) simultaneously, and ensures that they are precisely aligned in time through the inner monologue mechanism. The content generated at each time step contains not only the current speech, but also the corresponding text prefix, making the generated speech content more semantically logical.

5. “Internal Monologue” Mechanism

Moshi's "Inner Monologue" mechanism is one of the key innovations in its speech generation. Through this mechanism, Moshi predicts the corresponding time-aligned text tokens before generating audio . This not only improves the language consistency of the generated speech, but also supports real-time speech recognition (ASR) and text-to-speech (TTS) conversion.

Features of the "internal monologue" mechanism:

Aligned text and audio generation: Moshi first predicts the text and then generates the audio, making the generated speech more accurate and fluent in grammar and content.

Delay mechanism: By introducing a delay between text and audio, Moshi can perform ASR and TTS tasks separately. For example, if the text is generated first and the audio is generated later, the model is in TTS mode; otherwise, it is in ASR mode. Moshi can seamlessly switch between these two modes, ensuring that the model can both generate and recognize speech.

6.Multi-stream modeling

Moshi's architecture allows for the simultaneous processing of multiple audio streams, both to monitor the user's voice and to generate the system's own voice. During a conversation, Moshi can dynamically handle overlapping parts of the audio (such as interruptions, interleaving) without requiring a clear division of speaker turns in advance. This technology makes the conversation more natural.

Synchronous generation of semantic and acoustic tokens: Moshi uses a parallel semantic and audio token generation mechanism and optimizes the dependencies between these tokens by introducing time delays. By accurately modeling the audio flows of users and systems, Moshi is able to flexibly cope with complex conversation scenarios.

Dual-stream audio processing: Moshi processes both the user and system voice streams simultaneously, and achieves full-duplex conversation by modeling two autoregressive audio streams in parallel. This design allows the model to cope with overlapping speech and interruptions in natural conversations.

Delayed alignment of semantics and audio: By introducing a delay between semantic tokens and audio tokens, the generated speech content is ensured to be both coherent and efficient. The delay can be 1 to 2 frames, depending on the conversation dynamics.

Moshi: Multi-stream modeling — Moshi: **Multi-stream modeling**

7. Model training and fine-tuning

Large-scale pre-training: Moshi's text language model (Helium) has rich language understanding and generation capabilities through pre-training on more than 2.1 trillion English tokens. The model is trained on large-scale text and voice data and can handle a variety of complex conversation scenarios.

Unsupervised and supervised multi-stage training: Moshi first pre-trains on large-scale unsupervised speech data, then performs post-training on multi-stream data containing natural conversations, and finally performs instruction fine-tuning to make it perform better in actual conversations.

Helium pre-training: First, pre-train the Helium text language model on a large-scale text dataset to improve its language understanding and reasoning capabilities.
Moshi Pre-Training: Train a multi-stream audio model on an unlabeled audio dataset to learn to handle speech generation and semantic understanding.
Multi-stream fine-tuning: Use the Fisher dataset (containing two-channel voice dialogue data) to fine-tune the model and improve its ability to handle multi-stream voice input.
Instruction fine-tuning: Finally, the generated instruction dialogue data is used for fine-tuning to enhance the performance of the model in natural dialogue scenarios.

Data enhancement: During the training process, Moshi uses data enhancement techniques, such as adding background noise and simulating user echoes, to enable the model to perform stably in different voice environments and enhance its robustness.

Performance Evaluation of Moshi

1. Quality and consistency of speech generation

Speech clarity: Moshi performs well in speech generation, and experiments show that it can generate high-quality and easy-to-understand speech. It can maintain speech coherence during the generation process, especially in long conversations, which is an important performance indicator for conversation models in complex contexts.

Naturalness and consistency of speech: By using the Mimi neural audio codec, Moshi can generate high-fidelity speech and maintain the consistency of the system's speech. In addition, the model is able to generate appropriate emotional intonation based on different conversation contexts, improving the naturalness of the user experience.

2. Real-time response performance

Low latency: Moshi's latency is theoretically 160 milliseconds, and in actual testing it is about 200 milliseconds. This means that Moshi can respond to user input in near real time, significantly improving the smoothness of interaction and the user's conversation experience.

Full-duplex communication capability: Moshi demonstrated its ability to simultaneously receive and generate speech in testing. This full-duplex feature enables it to handle overlapping speech and interruptions in conversation, showing a response speed close to that of natural human conversation.

3. Speech Recognition and Dialogue Understanding

Automatic Speech Recognition (ASR): Through the Inner Monologue Method, Moshi combines text and speech streams to significantly improve the accuracy of speech recognition. The model not only captures the user's voice input, but also enhances the system's response accuracy by generating text predictions first.

Dialogue understanding and reasoning capabilities: Moshi uses the Helium language model for text understanding and reasoning, which enables it to perform well in handling complex questions, open-ended dialogues, and knowledge question answering. Experimental results show that Moshi can effectively understand the context and provide reasonable answers.

4. Robustness of multi-stream voice processing

Overlapping speech handling: Moshi was able to handle complex conversation scenarios in the evaluation, such as overlapping conversations with multiple speech streams. This is very important for multitasking in real-world applications, as natural conversations often involve interruptions and overlapping speech.

Multi-context conversation processing: Moshi is trained on multiple streams of data and is able to perform well in different conversation scenarios, whether it is a single user's voice stream or a conversation with multiple users at the same time.

5. Question Answering and Knowledge Acquisition

Moshi outperforms other current voice dialogue systems in question-answering and knowledge acquisition tasks. With powerful text understanding capabilities and real-time speech generation capabilities, Moshi can handle multiple rounds of question-answering and accurately extract and respond to user questions.

Linguistic Reasoning and Common-sense Question Answering: The model is capable of handling complex reasoning tasks and performs well on various standard evaluations in natural language processing (NLP), such as common-sense question answering, reading comprehension, and open-ended question answering.

6. Voice emotion and personalized generation

Emotional Speech Generation: Moshi demonstrated its ability to generate emotional speech in evaluation. It is able to generate speech output with different emotions, such as anger, happiness, or sadness, depending on the context of the conversation.

Personalized voice style: Through instruction fine-tuning during training, Moshi can generate voices of different styles or specific roles according to user requirements. This personalized capability makes its performance more diverse in specific conversation scenarios.

7. Safety and reliability

Safe Conversation Evaluation: Moshi demonstrates good safety when handling conversations containing sensitive or inappropriate content. It is able to effectively identify and avoid generating inappropriate content, ensuring the safety and ethical nature of the conversations.

Robustness and adaptation to noisy environments: In evaluations in noisy and complex environments, Moshi demonstrated good robustness. Through data augmentation techniques (such as noise addition and echo processing), the model is able to cope with different speech environments and ensure high-quality output in noisy environments.

8. Comprehensive test results

Moshi's comprehensive performance tests show that it has achieved leading results in speech generation, dialogue understanding, real-time response, and complex dialogue processing. In particular, Moshi's performance far exceeds that of traditional dialogue systems in handling overlapping dialogues, voice interruptions, and emotion generation.

Technical report: https://kyutai.org/Moshi.pdf

GitHub: https://github.com/kyutai-labs/moshi

Model download: https://huggingface.co/collections/kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd

Try online: https://moshi.chat/