Callab STT Benchmark
Telephony condition (8 kHz μ-law) • 100 samples/language • A100 40GB • 2026-05-04
Overall WER
Whisper v3 Turbo
8.1%
0.8B params • Batch (faster-whisper) • 6 langs
Parakeet TDT v3
5.6%
0.6B params • Batch (NeMo) • 5 langs
Cohere Transcribe
4.6%
2B params • Batch (Transformers) • 6 langs
Saudi ASR v2
18.7%
1.1B params • Batch (NeMo, finetuned) • 6 langs
WER by Language
0%
12%
24%
36%
48%
10.2
9.2
7.0
17.6
English
8.7
6.4
5.2
8.1
French
13.1
N/A
6.5
8.6
Arabic
6.3
3.9
2.9
19.3
Spanish
4.7
4.9
3.4
17.0
Portuguese
5.4
3.4
2.7
41.5
Italian
Whisper v3 Turbo
Parakeet TDT v3
Cohere Transcribe
Saudi ASR v2
Language Whisper v3 Turbo Parakeet TDT v3 Cohere Transcribe Saudi ASR v2
English 10.2% 9.2% 7.0% 17.6%
French 8.7% 6.4% 5.2% 8.1%
Arabic 13.1% N/A 6.5% 8.6%
Spanish 6.3% 3.9% 2.9% 19.3%
Portuguese 4.7% 4.9% 3.4% 17.0%
Italian 5.4% 3.4% 2.7% 41.5%
Overall 8.1% 5.6% 4.6% 18.7%
Speed (Real-Time Factor)
Whisper v3 Turbo
0.021x (48× realtime)
Parakeet TDT v3
0.010x (99× realtime)
Cohere Transcribe
0.038x (26× realtime)
Saudi ASR v2
0.024x (42× realtime)
Confidence Analysis (Log-Probability)
Each model reports confidence differently (per-token avg for Whisper/Cohere, per-word score for Parakeet).
Scales are not directly comparable — the useful signal is the correlation between confidence and accuracy.
Log-Prob Distribution (normalized per model)
Whisper v3 Turbo
-1.0
-0.0
med: -0.16
Parakeet TDT v3
-51.9
-3.3
med: -17.40
Cohere Transcribe
-40.7
-19.3
med: -27.80
Saudi ASR v2
-169.5
0.0
med: -65.69
Confidence vs WER (scatter)
Lower confidence (left) should predict higher WER (top). Good calibration = clear upward-left trend.
Whisper v3 Turbo
0%
50%
100%
log-prob (confidence →)
WER
Parakeet TDT v3
0%
50%
100%
log-prob (confidence →)
WER
Cohere Transcribe
0%
40%
80%
log-prob (confidence →)
WER
Saudi ASR v2
0%
50%
100%
log-prob (confidence →)
WER
Calibration: WER by Confidence Quintile
Samples split into 5 equal bins by model confidence. Well-calibrated models show monotonically decreasing WER from Q1 (least confident) to Q5 (most confident).
0%
7%
14%
21%
28%
15.3
7.5
7.6
6.1
3.8
3.1
4.6
4.5
6.3
9.3
1.6
4.1
4.1
3.6
9.8
12.9
16.4
19.2
21.6
23.4
Lowest
confidence
Q2
Q3
Q4
Highest
confidence
Whisper v3 Turbo
Parakeet TDT v3
Cohere Transcribe
Saudi ASR v2
Quintile Whisper v3 Turboavg WER | 0% WER rate Parakeet TDT v3avg WER | 0% WER rate Cohere Transcribeavg WER | 0% WER rate Saudi ASR v2avg WER | 0% WER rate
Q1 (lowest conf.) 15.3% | 42% 3.1% | 83% 1.6% | 88% 12.9% | 44%
Q2 7.5% | 55% 4.6% | 66% 4.1% | 81% 16.4% | 36%
Q3 7.6% | 57% 4.5% | 72% 4.1% | 75% 19.2% | 33%
Q4 6.1% | 64% 6.3% | 60% 3.6% | 73% 21.6% | 30%
Q5 (highest conf.) 3.8% | 74% 9.3% | 55% 9.8% | 44% 23.4% | 32%
Sample Comparison (highest model divergence)
en_003
English
REF Do you mean it?
Whisper v3 Turbo
Do you mean it? (logp: -0.51)
0.0%
Parakeet TDT v3
Do you mean it? (logp: -4.10)
0.0%
Cohere Transcribe
W. Win. It. (logp: -22.26)
75.0%
Saudi ASR v2
دول مين (logp: -36.04)
100.0%
fr_015
French
REF trois rue Grand Charriera, quarante-huit, zéro zéro zéro à Badaroux
Whisper v3 Turbo
3 rues Grand Charrière, 48 000 à Badaoui. (logp: -0.38)
81.8%
Parakeet TDT v3
trois rue Grand Charrière, quarante-huit, zéro zéro zéro à Badaou (logp: -27.28)
18.2%
Cohere Transcribe
trois rue Grand Charriere, quarante-huit, zéro zéro zéro à Badarou (logp: -26.29)
18.2%
Saudi ASR v2
trois rue Grand Charriera, quarante- huit zéro zéro zéro à Badaou (logp: -102.18)
9.1%
ar_012
Arabic
REF عير رأى أسد العرين فهاله
Whisper v3 Turbo
عير رأى أسد العارين فها له (logp: -0.18)
60.0%
Parakeet TDT v3
N/A
—
Cohere Transcribe
عِير رَأَى أَسَدُ الْعَرِينِ فَهَالَهُ (logp: -26.26)
0.0%
Saudi ASR v2
عير رأى أسد العرين فهاله (logp: -66.06)
0.0%
en_040
English
REF Geils began playing jazz trumpet but eventually switched to blues guitar.
Whisper v3 Turbo
Yes, we got playing jazz drum, but everyone should have switched to blues guitar. (logp: -0.73)
63.6%
Parakeet TDT v3
They began playing jazz trompet but eventually switched to blues guitar. (logp: -8.48)
18.2%
Cohere Transcribe
Zell began playing jazz trumpet, but eventually switched to blues guitar. (logp: -24.50)
9.1%
Saudi ASR v2
Jale began playing jazz trumpet but eventually switched to blues guitar. (logp: -82.74)
27.3%
en_061
English
REF To increase his popularity in public-opinion polls, the politician started a campaign.
Whisper v3 Turbo
To increase popularity in front of the union halls, the politicians started a campaign. (logp: -0.46)
53.8%
Parakeet TDT v3
To increase his popularity in public opinion polls, the politician started a campaign. (logp: -9.24)
0.0%
Cohere Transcribe
To increase his popularity in public opinion polls, the politician started a campaign. (logp: -23.98)
0.0%
Saudi ASR v2
To increase his popularity in public opinion holes the politicians started a campaign. (logp: -66.29)
30.8%
fr_064
French
REF cinq rue du Pousadou, zéro neuf, trois cent cinquante à Campagne-sur-Arize
Whisper v3 Turbo
5 rues du Pouzadou, 09 350 à Campagne sur Arise. (logp: -0.37)
69.2%
Parakeet TDT v3
cinq rue du Pouzadou, zéro neuf, trois cent cinquante à Campagne-sur-Arrize (logp: -34.11)
15.4%
Cohere Transcribe
cinq rue du Pouzadou, zéro neuf, trois cent cinquante à Campagne-sur-Arise (logp: -27.77)
15.4%
Saudi ASR v2
cinq rue du Pousadou, zéro neuf trois cent cinquante à Campagne- sur- Aryse (logp: -92.40)
7.7%
ar_037
Arabic
REF لسببٍ ما ، المايكروفون لم يعمل سابقًا.
Whisper v3 Turbo
لسبب الله المايكروفون لم يعلم سابعا (logp: -0.15)
50.0%
Parakeet TDT v3
N/A
—
Cohere Transcribe
لسبب ما المايكروفون لم يعمل سابقاً. (logp: -21.69)
0.0%
Saudi ASR v2
لسبب ما الميكروفون لم يعلم سابقا (logp: -61.78)
33.3%
fr_094
French
REF Deux larges balcons arrondis, en ferronnerie, surplombent la nef.
Whisper v3 Turbo
Deux larges balcons arrondis en ferronnerie sur plombes à nez. (logp: -0.29)
44.4%
Parakeet TDT v3
Deux larges balcons arrondis en ferronnerie surplombent la nef. (logp: -22.49)
0.0%
Cohere Transcribe
Deux larges balcons arrondis, en ferronnerie, surplombent la nef. (logp: -28.25)
0.0%
Saudi ASR v2
Deux larges balcons arrondis en ferronnerie surplombent la nef (logp: -80.45)
0.0%
ar_065
Arabic
REF ولبيوتهم أبوابا وسررا عليها يتكئون
Whisper v3 Turbo
وليبيوتهم أبوابا وسرورا عليها يتكئون (logp: -0.16)
40.0%
Parakeet TDT v3
N/A
—
Cohere Transcribe
وَلِبُيُوتِهِمْ أَبْوَابًا وَسُرُرًا عَلَيْهَا يَتَّكِئُونَ (logp: -24.89)
0.0%
Saudi ASR v2
وَلِبُيُوتِهِمْ أَبْوَابًا وَسُرُرًا عَلَيْهَا يَتَّكِئُونَ (logp: -169.53)
0.0%
it_006
Italian
REF Fu amico e compagno di classe di Albert Einstein.
Whisper v3 Turbo
suo amico è il compagno di classe di Albert Einstein (logp: -0.23)
33.3%
Parakeet TDT v3
Fu amico e compagno di classe di Albert Eisen. (logp: -6.99)
11.1%
Cohere Transcribe
Fu amico e compagno di classe di Albert Einstein. (logp: -34.38)
0.0%
Saudi ASR v2
Fue amigo ex compaño de clase de advertisement. (logp: -53.97)
111.1%
es_031
Spanish
REF Esto llevó al declive de la literatura en lenguas vernáculas.
Whisper v3 Turbo
Esto llegó al declínio de la literatura en lenguas vernáculares. (logp: -0.29)
30.0%
Parakeet TDT v3
Esto llevó al declive de la literatura en lenguas vernáculas. (logp: -19.15)
0.0%
Cohere Transcribe
Esto llevó al declive de la literatura en lenguas vernáculas. (logp: -27.84)
0.0%
Saudi ASR v2
Esto llevó al declive de la literatura en lenguas vernáculas. (logp: -67.96)
20.0%
es_021
Spanish
REF La pareja dividió uniformemente esa cantidad y sus bienes personales, terminando su relación comercial.
Whisper v3 Turbo
La pareja dividió uniformemente su cantidad y sus bienes personales, terminando su relación con el chef. (logp: -0.21)
28.6%
Parakeet TDT v3
La pareja dividió uniformemente esa cantidad y sus bienes personales, terminando su relación comercial. (logp: -19.58)
0.0%
Cohere Transcribe
La pareja dividió uniformemente esa cantidad y sus bienes personales, terminando su relación comercial. (logp: -28.16)
0.0%
Saudi ASR v2
La pareja dividió uniformemente su cantidad y sus bienes personales terminando su relación comercial. (logp: -66.53)
21.4%
Methodology: 600 samples from Common Voice 17 (test split), converted to 8 kHz μ-law mono (PSTN format).
WER computed with NFKC + lowercase + punctuation-strip normalization; Arabic: diacritics stripped + alef/ya/teh-marbuta normalized.
All models receive the same audio and language hint. Inference on NVIDIA A100-SXM4 40GB.
Models: Whisper large-v3-turbo (faster-whisper, float16, avg_logprob per segment) •
Parakeet TDT 0.6B v3 (NeMo, hypothesis.score / n_words) •
Cohere Transcribe 03-2026 (transformers, bfloat16, avg token log_softmax)