ToucanTTS: Speech Synthesis in Over 7,000 Languages

type

status

date

slug

summary

category

icon

password

The Institute for Natural Language Processing (IMS) at the University of Stuttgart has developed a super full text-to-speech model, ToucanTTS. ToucanTTS is designed for teaching, training, and using the most advanced speech synthesis models. It is the most multilingual TTS model currently, supporting speech synthesis in more than 7,000 languages, with multi-speaker speech synthesis capabilities, and can simulate the rhythm, stress, and intonation of multiple speakers.

ToucanTTS provides interactive demos of various applications, including voice design, style cloning, multilingual speech synthesis, and human-edited poetry reading, demonstrating its versatility and powerful performance.

The toolkit is based on the FastSpeech 2 architecture and includes some improvements, such as the regularized stream PostNet based on PortaSpeech, which ensures natural and high-quality speech synthesis. ToucanTTS also includes an aligner trained using connectionist temporal classification (CTC) and spectrogram reconstruction for multiple purposes.

Main Features of ToucanTTS

Multi-language support:

It supports almost all ISO-639–3 standard languages, which means it can theoretically support more than 7,000 languages. It is the TTS model that currently supports the most languages. This makes it widely applicable around the world and meets the needs of users with different language backgrounds. Through the built-in language embedding model, it can seamlessly switch between multiple languages to achieve multi-language synthesis.

Multi-speaker speech synthesis

The toolkit supports multi-speaker speech synthesis, which can simulate the rhythm, stress, and intonation of different speakers. This is very useful for applications that require stylistic diversity and voice customization.

Controllable speech synthesis

The toolkit allows users to control multiple parameters of speech, including pitch, speaking rate, emotion, etc. With this control, speech output with different emotions or styles can be generated.

High-quality speech generation

Using the PyTorch framework, IMS-Toucan uses the most advanced deep learning technology to ensure high fidelity and naturalness of speech generation. The model supports end-to-end training and reasoning, and can handle complex speech synthesis tasks.

Human editing

ToucanTTS includes human-in-the-loop editing capabilities, which are particularly useful for literary research and poetry reading tasks. Users can customize the synthesized speech according to their needs and preferences.

Self-contained aligner

The toolkit also contains an aligner trained using Connectionist Temporal Classification (CTC) and spectrogram reconstruction for a variety of purposes. This improves the accuracy and quality of speech synthesis.

Data preprocessing tools

Provides a complete set of data preprocessing tools, including text cleaning and feature extraction, to simplify the preparation of training data.

GitHub: https://github.com/DigitalPhonetics/IMS-Toucan

Online demo: https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS

Dataset: https://huggingface.co/datasets/Flux9665/BibleMMS

💡

Sensitive Data at Risk? Experience enterprise-grade security

cloudcone | satisfactory dedicated server