Accuracy benchmark

How accurate is FancyCaptions on Dutch & Flemish?

On our 2026-06-11 benchmark of 81 Dutch clips, our default engine (Scribe v2) reached a 25.3% word error rate— ahead of Whisper's 28.7% on the same set. Lower WER is better. That's strong Dutch and Flemish recognition, and we publish the dataset so you can check it.

25.3% WER
Scribe v2 on 81 Dutch clips — our 2026-06-11 benchmark
−3.4 pts
More accurate than Whisper (28.7%) on the identical set
0 divergence
Render parity vs the leading paid tool, across 1,647 frames

In short: we benchmarked six speech-to-text engines on 81 real Dutch clips on 2026-06-11. Scribe v2 — our default for Dutch and Flemish — reached 25.3% word error rate, ahead of Whisper's 28.7%. We report the real, dated number rather than a round 99%, and the dataset and method are below so you can audit it.

By the FancyCaptions team — the people behind a pixel-parity caption engine. Last updated June 25, 2026.

Which engine is most accurate on Dutch?

Scribe v2 and Speechmatics tie for the lead at 25.3% WER, with a keyterm glossary nudging Scribe v2 to 25.0%. Whisper large-v2 — the previous default — was the worst of the six at 28.7%. Lower word error rate is better. Here is the full ranking on the 81 Dutch clips.

Engine / configWER (lower is better)Notes
Scribe v2 + keyterm glossaryOurs25.0%Our best config — customer glossary biasing
Scribe v2 (our default for NL/Flemish)Ours25.3%Recommended default, temperature 0
Speechmatics (enhanced)25.3%Co-leader; configured alternative
Scribe v126.2%~1pt behind v2
AssemblyAI Universal-227.0%
Whisper large-v228.7%The previous prod default — worst here

Source: our ASR parameter sweep, 81 Dutch clips, measured 2026-06-11. Micro-averaged WER, lenient scoring, no LLM correction (raw engine). Whisper was our previous Dutch default; the study moved us to Scribe v2.

What's the dataset and method?

We tested 84 real short-form speech clips — 81 genuinely Dutch, plus 3 with translated, non-matching captions that we excluded from the Dutch ranking. The set was split 42 dev / 42 test with an interleaved seed so no engine could be tuned to the scored half. The reference transcript was a mainstream auto-caption tool's output, scored leniently with micro-averaged word error rate, and no LLM cleanup was applied — this measures the raw speech-to-text engine, not a post-processor.

  • Dataset: 84 clips (81 Dutch + 3 translation-mismatch), real short-form speech.
  • Split: 42 dev / 42 test, interleaved seed — held-out test prevents overfitting.
  • Metric: micro-averaged word error rate (WER); lower is better.
  • Reference: a mainstream auto-caption tool's output, lenient scoring; no LLM correction (raw engine).

Is the result overfit — and what does 25% actually mean?

No, the ranking holds on held-out data. The same engines lead on both the dev and test splits — Scribe and Speechmatics best, Whisper worst — so the engine choice isn't a fluke of one half. There's a uniform ~4–5 point dev-to-test gap, which is split-difficulty variance, not tuning.

ConfigDev WERTest WER
Scribe v2 + keyterms22.6%27.3%
Scribe v222.9%27.6%
Speechmatics24.4%27.0%
Whisper27.4%30.0%

The ~25% figure also isn't all real error. About 38% of the substitutions are cosmetic — digits versus number-words, dialect spelling, pronoun variants — and the reference is itself machine-plus-human captions, not perfect truth. Accounting for that, the true content error is closer to ~21%. We still publish the conservative 25.3% because it's the number you can reproduce against the same reference.

What about English — and why Dutch is the differentiator

We benchmarked Dutch specifically, so we won't quote an English WER we didn't measure here. As a rough industry guide, English and other high-resource languages typically reach around 78–80% word accuracy — roughly 19–22% WER — on real short-form audio, and English usually transcribes more cleanly than Dutch. So our English is solid and on par with the field; it isn't where we stand out.

Where we genuinely differ is twofold. First, Dutch and Flemish: most caption tools treat them as just another language and route them to a generic model, where our routing to Scribe v2 measurably wins. Second, render parity: independent of which words come back, our on-screen captions match the leading paid tool frame-for-frame, at zero divergence across 1,647 reference frames. Accurate words and an exact render are separate problems, and we can show our work on both.

Frequently asked questions

How accurate is FancyCaptions on Dutch and Flemish?

On our 2026-06-11 benchmark of 81 Dutch clips, our default engine (Scribe v2) reached a 25.3% word error rate — ahead of Whisper's 28.7% on the same set, and matched only by Speechmatics. Lower WER is better. That's strong Dutch and Flemish recognition relative to the common Whisper baseline; we publish the date and clip count so the figure is auditable rather than a round marketing number.

What is word error rate (WER) and why not just quote 99%?

Word error rate is the share of words the transcript gets wrong — substitutions, deletions and insertions divided by the reference word count — so lower is better. We report WER rather than a flashy 99% accuracy figure because WER is the standard, auditable measure, and a single round percentage hides how a model behaves on accented, fast or noisy speech. We give you the real number with its date and dataset so you can judge it yourself.

Is FancyCaptions more accurate than Whisper on Dutch?

Yes, on our benchmark. Scribe v2 scored 25.3% WER versus Whisper large-v2's 28.7% on the same 81 Dutch clips measured 2026-06-11 — a 3.4-point improvement, almost entirely from fewer dropped words (Scribe deletes 8.6% vs Whisper's 11.9%). Switching our Dutch default from Whisper to Scribe v2 was the single biggest free accuracy win in the study.

How accurate is FancyCaptions in English?

We benchmarked Dutch specifically, not English, so we don't publish an English WER from this study. As a rough guide, high-resource languages like English typically land around 78–80% word accuracy (roughly 19–22% WER) on real-world short-form audio, and English usually transcribes more cleanly than Dutch. The honest differentiator is Dutch and Flemish, where most tools struggle — plus the render parity, which is exact.

What was the dataset and method?

We used 84 real short-form speech clips (81 Dutch plus 3 with translated, non-matching captions), split 42 dev / 42 test with an interleaved seed. The reference ("ground truth") was a mainstream auto-caption tool's output, scored leniently with micro-averaged WER, and no LLM correction was applied — this measures the raw ASR engine. The harness caches audio and per-config transcripts so re-runs are cheap and reproducible.

Beyond transcription, how do I know the captions render correctly?

Transcription accuracy is one half; the other is whether the on-screen caption matches what you previewed. FancyCaptions renders captions frame-for-frame matched to the leading paid caption tool, measured at zero divergence across 1,647 reference frames (552 + 1,095 across 37 styles). So the animated style you pick in the editor is exactly what exports — the preview and the export run the same render path.

Explore FancyCaptions

Don't take our word for it — test your own clip

Upload a Dutch or Flemish clip and read the transcript in seconds — free, no sign-up. The best benchmark is your own footage.