type
status
date
slug
summary
tags
category
icon
password
ElevenLabs has launched Scribe, an Automatic Speech Recognition (ASR) model, touted as the world's most precise speech-to-text solution. Benchmark testing has demonstrated its superior accuracy, surpassing previous leading models such as Gemini 2.0 and OpenAI Whisper v3.
Scribe is capable of transcribing speech in 99 languages and is applicable to a wide range of real-world audio scenarios, including meeting minutes, film subtitles, and song lyric transcription.
Key Features of Scribe:
- Multilingual Support: Accurately transcribes speech content in 99 languages, significantly reducing recognition errors for low-resource languages such as Serbian, Cantonese, and Malayalam.
- High-Precision Speech-to-Text Conversion:
- Demonstrates exceptional performance in multiple industry benchmark tests (FLEURS & Common Voice).
- Achieves a 98.7% accuracy rate for Italian and 96.7% for English.
- Advanced Audio Processing Capabilities:
- Word-level Timestamps: Provides timestamps at the word level, facilitating字幕 synchronization or audio editing with precision.
- Speaker Diarization: Identifies and differentiates up to 32 distinct speakers within the same audio recording.
- Audio-event Tagging: Detects and tags non-verbal elements such as laughter, applause, and background noise, enriching the transcribed content.
- API Support & Seamless Integration:
- Offers structured JSON output, enabling developers to easily integrate Scribe into their applications or platforms.
- Currently supports pre-recorded audio and video files, with a low-latency real-time transcription version planned for future release to accommodate live streaming, conferences, and other real-time transcription needs.
Advantages of Scribe:
- Industry-Leading Accuracy:
- Scribe outperforms top-tier models, including Gemini 2, Whisper Large v3, and Deepgram, across critical ASR benchmarks in nearly all languages.
- Achieves the lowest Word Error Rate (WER) in 102 languages in FLEURS & Common Voice evaluations.
- Scribe’s WER is lower than Google Gemini 2.0 Flash, OpenAI Whisper v3, and Deepgram Nova-3, with particularly strong performance in languages like Italian (WER 1.3%) and English (WER 3.3%).
- Low-Resource Language Optimization:
- Achieves significant improvements in languages traditionally challenging for models (e.g., Serbian, Malayalam), with substantial reductions in WER.
- Complex Scene Adaptability:
- Maintains high accuracy in noisy environments or multi-speaker scenarios, making it suitable for diverse real-world applications.
- Feature-Rich Functionality:
- Offers speaker diarization, timestamps, and non-speech event detection, surpassing the basic transcription capabilities of many competitors.
- Competitive Pricing:
- Priced at $0.40 per hour of audio, with a discounted rate of $0.20 per hour for the first six weeks after launch, offering compelling value compared to similar services on the market.
- Ease of Integration:
- Accessible through dashboard file uploads or API calls, catering to a variety of user needs.
API Documentation: [API Documentation](API Documentation)
Online Experience: https://elevenlabs.io/speech-to-text
Official Introduction: https://elevenlabs.io/blog/meet-scribe
- Author:KCGOD
- URL:https://kcgod.com/scribe
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!