Create & Edit Music Easily with ByteDance’s Seed-Music AI Model

type

status

date

slug

summary

Key Features Seed-Music

High-quality music generation

supports the generation of vocal and instrumental works. Users can input through text, audio and other methods to achieve diversified music creation.

Controlled Music Generation

Provides fine-grained music control, allowing users to generate music that meets their requirements based on lyrics, style descriptions, reference audio, sheet music, etc.

Multimodal input: Seed-Music supports multiple input methods, such as lyrics, music style description, reference audio, sheet music, voice prompts, etc., to achieve fine-grained control.

Style control: Users can specify the style, rhythm, melody, etc. of music through text or audio references to generate works that meet their needs.

Vocal synthesis and conversion

Singing Voice Synthesis: Generate natural and expressive singing voices in multiple languages.

Zero-sample singing conversion: Just 10 seconds of voice or singing recording can be converted into music of different styles.

Lyrics2Song: Convert input lyrics into vocal music with accompaniment, supporting short and long music generation.

Audio cues and style transfer: Supports audio continuation and style transfer, generating new music of similar style based on existing audio.

Instrumental Music Generation: Generate high-quality pure instrumental music, suitable for scenes without lyrics.

Music post-editing

supports modification of lyrics and melody, allowing users to edit and adjust directly on the generated audio.

Lyrics and melody editing: Seed-Music provides interactive tools that allow users to edit lyrics and melody directly in the generated audio, making it easier to make later adjustments.

Music mixing and arrangement: The system can not only generate complete songs, but also support modification of the generated songs, such as adjusting instrument parts, mixing effects, etc.

Multi-style and multi-language support

Seed-Music can generate works covering a variety of music styles (such as pop, classical, jazz, electronic, etc.), and supports multi-language singing generation, making it suitable for global users.

Real-time generation and streaming support

Supports real-time music generation and streaming output to improve user interactivity and creative efficiency.

Architecture of Seed-Music

The architecture of Seed-Music consists of three modules: representation learning module , generation module and rendering module . These modules work together to generate high-quality music from multimodal inputs (such as text, audio, sheet music, etc.).

Representation Learning Module

compresses the raw audio signal into three intermediate representations (audio symbols, symbolic music tags, and vocoder latent representation), each suitable for different music generation and editing tasks.

Generation module

Generates corresponding music representations based on the user's multimodal input through autoregressive language model and diffusion model.

Rendering module

Converts the generated intermediate representation into high-quality audio waveforms and uses a diffusion model and vocoder to render the final audio output.

Technical Methods of Seed-Music

Seed-Music uses a variety of generation technologies to ensure that the system can flexibly respond to different music generation and editing needs:

Auto-Regressive Language Model

Based on user input (such as lyrics, style description, audio reference, etc.), audio symbols are generated step by step. This method is suitable for music generation tasks that require strong context dependence, such as lyrics generation and style control. This technology can generate music symbols step by step, just like writing a song word by word based on a piece of lyrics. It can well control the rhythm, melody and lyrics matching of music.

**Seed-Music: process of music generation**

Diffusion Model

Suitable for complex music generation and editing tasks, it can generate clear music representations through gradual denoising. Diffusion models are very suitable for tasks that require multi-step predictions and high fidelity, such as fine audio editing. It gradually "polishes" complex audio into clear music, which is very suitable for post-editing or adjusting the details of music.

Vocoder

Similar to translating "music code" into high-quality sound files, generating music that can be played directly. Responsible for converting the generated representation into the final high-quality audio. Through variational autoencoder (VAE) technology, the vocoder can generate 44.1kHz high-fidelity stereo.

Intermediate Representation

Seed-Music uses three different intermediate representations for different generation tasks

Audio Tokens: Used to encode music features such as melody, rhythm, and harmony, suitable for autoregressive models. Contains information such as the melody and rhythm of music, suitable for generating specific music clips.

Symbolic Music Tokens: Like sheet music, they are used to represent the melody and chords of music and can be used for sheet music generation and editing. For example, MIDI is suitable for sheet music generation and editing tasks and provides readable and editable music representation.

Vocoder Latents: handles more complex sound details and is suitable for fine-grained editing and generating complex musical works. Suitable for the generation and editing tasks of diffusion models.

Training and Reasoning of Seed-Music

Seed-Music's model training is divided into three stages: pre-training, fine-tuning, and post-training:

Pre-training: Establish basic capabilities for generating music by pre-training models with large-scale music data.

Fine-tuning: Fine-tune the model through specific tasks or data to improve the performance of the model in specific generation tasks, such as improving musicality and generation accuracy.

Post-training (reinforcement learning): Optimize the controllability and music quality of the generated results through reinforcement learning, and use reward models such as the matching degree between lyrics and audio, the consistency of music structure, etc. to optimize the output quality.

During inference, Seed-Music uses streaming generation technology, which enables users to experience the generation process in real time and provide feedback and adjustments based on the real-time generated content.

Project address and case display: https://team.doubao.com/en/special/seed-music

Technical report: https://arxiv.org/pdf/2409.09214