AI StrategyIntermediate7 min read

AI Translation Quality

AI translation quality measurement combines automatic metrics (BLEU, chrF, COMET, BLEURT) with human evaluation (LQA — Language Quality Assessment using MQM or DQF rubrics) and Quality Estimation (QE) models that score translations without a reference. Modern programs use COMET or COMET-Kiwi as the production metric (correlates much better with human judgment than BLEU), MQM-based LQA for sample auditing, and per-segment QE scores to route content for post-editing. The goal isn't a single quality number — it's a calibrated routing decision: which segments are good enough to publish, which need light edit, which need full re-translation. Without quality measurement, every other localization decision (vendor selection, MT engine choice, post-edit budget) is guesswork.

Also known asMT QualityTranslation Quality EstimationLQATranslation EvaluationMTQE

Challenge a friend Browse library

The Trap

The trap is reporting BLEU scores in 2026. BLEU was state-of-the-art in 2002; it correlates poorly with modern neural and LLM translation quality and produces misleading vendor comparisons. Teams that benchmark MT vendors on BLEU consistently pick the wrong vendor and over-invest in post-editing where they don't need to. The KnowMBA POV: if your translation quality program is built on BLEU, you are flying blind. Modern programs use COMET (or COMET-Kiwi for reference-free), backed by MQM-based human evaluation on a sampled subset, with the LLM-as-judge as a third signal where it agrees with humans.

What to Do

Stand up a quality program in four steps. (1) Pick COMET or COMET-Kiwi as the production metric. Build a baseline on your content per locale. (2) Run MQM-based LQA on a quarterly sample (e.g., 200 segments per locale) using qualified linguists scoring against a defined rubric. Track major/minor error categories. (3) Deploy a Quality Estimation (QE) model per locale that scores live segments and routes high-confidence to publish, mid to post-edit, low to human re-translation. (4) Close the loop: every LQA finding feeds back into glossary, TM, prompt, or model selection. Track quality trend per locale per quarter — if it's not improving, your loop is broken.

Formula

Per-Segment Routing Decision = QE Score Threshold → {publish | post-edit | re-translate}

In Practice

COMET (Crosslingual Optimized Metric for Evaluation of Translation), developed by Unbabel and the WMT community, became the de facto modern MT metric, replacing BLEU in serious enterprise quality programs. Unbabel built a translation business specifically around the QE+post-edit workflow, claiming higher quality at lower cost than traditional translation agencies. ModelFront ships QE as a service, integrating with TMS platforms to route segments by predicted quality. Lokalise and Smartling embed QE directly in their workflows. The pattern: enterprise localization has moved decisively to COMET + MQM + QE; vendors and platforms still using BLEU as the headline metric are signaling outdated practice.

Pro Tips

01
MQM (Multidimensional Quality Metrics) is the modern human-evaluation framework — error categories include accuracy (mistranslation, omission), fluency (grammar, spelling), terminology, style, locale convention, and audience appropriateness. Each major error counts more than each minor. MQM is significantly more diagnostic than the older 'good/bad/ugly' subjective scales.
02
Quality Estimation (QE) models like COMET-Kiwi predict translation quality without needing a reference translation, which means you can score every production segment in real time. This is what enables tier routing at scale. Teams that adopt QE typically reduce post-edit volume by 20-40% by skipping segments that don't need editing.
03
LLM-as-judge can supplement (not replace) human MQM evaluation. Claude, GPT-4, and Gemini score translations reasonably well when prompted with the MQM rubric, the source, and the translation. The pattern: LLM-judge for breadth (every segment), human MQM for depth (sampled audit). Use both — they catch different errors.

Myth vs Reality

Myth

“BLEU is fine for vendor comparison”

Reality

BLEU correlates so poorly with modern MT quality that two systems with identical BLEU can have very different human-perceived quality. WMT has used COMET as the primary metric for several years for exactly this reason. Continuing to report BLEU in 2026 is a signal the team isn't keeping up.

Myth

“A single quality score per locale is sufficient”

Reality

Quality varies by content type within a locale (UI vs marketing vs legal), by domain (medical vs technical vs general), and by direction (English→German is not symmetric to German→English). Per-locale, per-content-type tracking is the realistic minimum for a serious program.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your localization team is choosing between two MT vendors. Vendor A scores 42 BLEU and Vendor B scores 39 BLEU on your test set. Vendor B scores 0.82 COMET; Vendor A scores 0.75 COMET. Human LQA shows Vendor B has 30% fewer major errors per 1,000 words. Which vendor and why?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

MQM Major Error Rate per 1,000 Words (Production MT + Post-Edit)

MQM-based human LQA on production-tier translations

Excellent

< 1.0

Good

1.0-3.0

Acceptable for T2 Content

3.0-7.0

Below Standard

> 7.0

Source: WMT and MQM Council practitioner benchmarks

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

📐

Unbabel + COMET

2017-2026

success

Unbabel co-developed COMET with the WMT research community and built a translation business specifically around the QE-routing + human-in-the-loop workflow. By scoring every segment with COMET-Kiwi, Unbabel routes only the genuinely uncertain segments to human editors, claiming significant cost reduction vs traditional agencies while improving quality through systematic measurement. COMET became the de facto modern MT metric; Unbabel's adoption pattern is now common across enterprise localization platforms.

Co-Developed Metric

COMET (now industry standard)

Workflow

QE-routed human-in-the-loop

Industry Impact

BLEU largely displaced by COMET in enterprise

Better metrics enable better routing; better routing enables both cost reduction and quality improvement. Investing in measurement is the highest-leverage move in a localization quality program.

Source ↗

🎯

ModelFront

2018-2026

success

ModelFront built a Quality Estimation API specifically for translation. By integrating with TMS platforms (memoQ, Phrase, Trados), they let enterprise localization teams route segments to publish, post-edit, or re-translate based on predicted quality without requiring a reference translation. Customer reports cite 20-40% post-edit reduction with no quality loss and faster time-to-publish on auto-routed segments. The product is a focused example of how QE changes the localization economics.

Reported Post-Edit Reduction

20-40%

Integrations

memoQ, Phrase, Trados, custom

Approach

Reference-free QE per segment

Quality Estimation turns localization from a flat-cost activity into a tiered, intelligent workflow. The platform pays for itself by removing post-edit work that wasn't needed in the first place.

Source ↗

Related concepts