LSLM: Continuous Conversation, Real-Time Responses

type

status

date

slug

summary

What problem was solved by LSLM?

Real-time interaction:

Most existing speech language models are turn-based and cannot listen and process new inputs while generating speech. This limits the application of the model in real-time conversations.

LSLM can achieve full-duplex modeling, that is, you can hear external sounds while speaking, thus supporting real-time voice interaction.

Handling interrupts:

In real conversations, people often interrupt each other, and existing models lack the ability to handle such interruptions. When generating speech, if the user is dissatisfied or needs to be interrupted, the existing models cannot respond and adjust in time.

By fusing listening and speaking signals, LSLM can immediately stop the current speech generation when an interruption signal is detected, thereby improving the naturalness and flexibility of the interaction.

Noise robustness and command sensitivity:

Real-life interaction scenarios often contain various noises, and existing models perform poorly when processing noisy inputs.

The LSLM takes the noisy environment into consideration in its design, and experiments have verified that it can still accurately identify and respond to user commands under noisy conditions, thus improving the robustness and adaptability of the model.

Through the above improvements, LSLM significantly improves the performance of voice dialogue systems in practical applications, enabling human-computer interaction to be more natural and smooth, and suitable for a wider range of real-world scenarios.

Key Features of LSLM

Real-time speech generation:

LSLM can generate speech instantly during a conversation, similar to when someone is talking to you, instead of having to wait for the other person to finish speaking before responding.

Real-time voice input processing:

This model can hear what you say while it's talking. This means that even if you interrupt it while it's talking, it can still hear and respond.

Signal Fusion:

To better process what is heard and what is said, LSLM tries several different methods to bring the two pieces of information together to make the conversation flow more smoothly.

Interrupt detection and handling:

LSLM can recognize your interruption signal when you are speaking, and will stop speaking immediately when you interrupt it and wait for your instructions.

Noise processing:

It is able to work in noisy environments, such as when there is background noise, and can still clearly hear your commands and respond correctly.

Multi-instruction processing:

LSLM is able to understand and respond to a variety of different instructions, not just limited to specific commands, which makes it more flexible and practical in actual use.

LSLM: Multi-instruction processing

LSLM Technical Approach

LSLM (Listening-while-Speaking Language Model) achieves its functions through the following technical methods:

Model Architecture

1. Token-based decode-only TTS (text-to-speech)

Simulates the ability to speak, by gradually generating speech tokens.

Speech Generation: LSLM uses a token-based decoder-only TTS model to generate speech. This model converts text into speech tokens and then generates continuous speech output through a decoder.

Autoregressive model: When generating each speech token, the model depends on the previously generated tokens and the current context, which ensures the coherence of the speech.

2. Streaming Self-Supervised Learning (SSL) Encoder

Simulates the ability to listen and process audio input.

Audio input processing: LSLM uses a streaming SSL encoder to process real-time audio input. This encoder can continuously receive and encode audio signals and convert them into continuous embedding vectors for use by the model.

Real-time feature extraction: Since the encoder is streaming, it can extract features in real time during speech input, ensuring that the model can respond in a timely manner.

Signal fusion strategy

LSLM explores three different signal fusion strategies to optimize the integration of listening and speaking:

Early Fusion: Fusion of the audio signal heard and the speech signal to be generated in the early stages of speech generation. This is to fuse what is heard and what is to be said before the model starts processing the data. It is like preparing what to say while listening.

Middle Fusion: In each Transformer block, the heard audio signal is fused with the speech signal to be generated to optimize the model's generation and response capabilities. This means that what is heard and what is to be said are fused multiple times as the model processes the data. This is like listening and speaking at the same time, and adjusting what is said at the same time.

Late Fusion: Fuse the heard audio signal with the generated speech signal before the final output. Fuse what the model hears and what it says before it outputs the result. It's like preparing everything you want to say first and then making final adjustments based on what you hear.

LSLM achieves real-time conversation through full-duplex modeling.

At each time step t, the model uses both historical information from the speaking channel and real-time information from the listening channel to predict the next token.

When an interrupt signal is detected, the model should terminate the current action within a detection interval after the interruption begins.

The model uses a streaming SSL encoder to perform real-time audio feature extraction to ensure that it can respond to the heard signal in real time during the generation process.

Here are two test scenarios:

Command-driven: Test how the model handles user commands. For example, when you say "turn on the music" to the voice assistant, the model can generate a corresponding voice response while hearing the command.

Voice-driven: Tests how the model handles continuous voice input. For example, when you talk to a voice assistant, it can generate responses while hearing you speak without interrupting the flow of the conversation.

Interrupt detection and handling

Interrupt Token (IRQ): LSLM introduces an interrupt token. When an interrupt signal is detected, the model outputs this token and stops the current speech generation.

Real-time interrupt response: The model can respond promptly and stop generating within a short time (such as 0.5 seconds) after detecting the interrupt signal, waiting for new instructions.

Noise robustness

Noise processing: During the model training process, various background noise data are added so that the model can still accurately process and respond to user commands in noisy environments in actual applications.

Multiple instruction processing

Diversified command training: A variety of different command data sets were used during model training, enabling it to handle various types of commands and voice inputs, improving the flexibility of practical applications.

Experimental Results of LSLM

LSLM conducted a variety of experiments to verify its performance in different scenarios. The following are the key results and findings of the experiments:

1. Experimental Setup

The experiments are mainly divided into two categories: command-driven full-duplex modeling (Command-based FDM) and voice-driven full-duplex modeling (Voice-based FDM). In these two scenarios, the model uses early fusion, mid-term fusion, and late fusion strategies to test its performance.

2. Key Findings

Early Fusion:

Advantages: Can respond quickly, suitable for simple conversation scenarios.
Disadvantages: The generated speech may not be natural enough in complex conversations.

Middle Fusion:

Advantages: It achieves the best balance between speech generation and real-time interaction. The model can continuously adjust the generated content during the conversation, making the conversation more natural and smooth.
Disadvantages: Computationally expensive, but performs best in most scenarios.

Late Fusion:

Advantages: Suitable for use in scenarios that require high precision, and the generated speech is more accurate.
Disadvantages: Slightly slow to respond and may not be flexible enough in real-time conversations.

3. Experimental Results

Command-driven FDM:

In command-driven scenarios, LSLM can understand and respond to user commands quickly and accurately.
The mid-term fusion strategy performs best when processing complex commands, being able to generate appropriate spoken responses while listening to the command.
Experiments show that LSLM has high accuracy and stability in command-driven conversations.

Voice-driven FDM:

In voice-driven scenarios, LSLM is able to process continuous speech input and generate responses in real time during the conversation.
The mid-term fusion strategy also performs well and can maintain efficient interaction in the face of diverse voice inputs.
Experimental results show that LSLM performs well in complex conversation scenarios and can achieve natural and smooth real-time conversations.

4. Performance in noisy environments

Robustness test:

LSLM also performs well in noisy environments, accurately recognizing speech input and generating responses in the presence of noisy backgrounds.
The mid-term fusion strategy performs particularly well in noisy environments, ensuring the continuity and accuracy of conversations.

Project address: https://ziyang.tech/LSLM/

Paper: https://arxiv.org/pdf/2408.02622

👍🏼

Don't let slow servers hold you back