BLEU Score

Krishna Pullakandam
3 min readMay 13, 2022

What it stands for — BiLingual Evaluation Understudy

BLEU (BiLingual Evaluation Understudy), is a metric for automatically evaluating machine translated (MT)text. The BLEU score is a number between zero and one that measures the similarity of the machine translated text to a set of high-quality reference translations.

Understudy — in the theatre world, an understudy is someone who learns the role of the most senior actor so that they can take over someday

A value of 0 means that the machine-translated output has no overlap with the reference translation (low-quality translation) while a value of 1 means there is perfect overlap with the reference translations (high-quality translation).

It has been shown that BLEU scores correlate well with a human judgment of translation quality. Note that even human translators do not achieve a perfect score of 1.0

Evaluating MT with an example:

Let’s say we have a french sentence and two acceptable references 1 and 2 of translation in the English language. Bleu score can be used to evaluate how well a certain machine translation (MT) system works.

The intuition here is as long as the machine-generated translation is pretty close to any of the references provided by humans then it will get a high blue score.

Frech: Le chat est sur le tapis.

Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
MT Output: the the the the the the the.

Precision: 7 (every one of these words is present in either of the references)/7 (there are seven words in the output) = 100% has very high precision.

Modified precision: 2 (count of `the` twice in reference 1 and once in reference 2, max of both) / 7 (there are seven words in the output) = 28.57%

Bleu score on bigrams:

Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
MT Output: The cat the cat on the mat.

Bigrams in the MT output:

Modified bigram precision = 4 / 6

If we want to generalize the approach we followed with the unigram and bigram approaches above, and write a formula for a generalized n-gram approach:

Note: If the MT output is exactly the same as either of the reference outputs, all the scores p1, p2 … = 1.0

Notation: Pn = Bleu score on n-grams only

Let’s say we have individual scores: P1, P2, P3, P4 (for each of the n-grams), how do we measure and communicate a single metric score?

Combined Bleu score:

If we look at the function above, it’s exponential of the arithmetic mean of all the scores, and the exponential function is a monotonically increasing function, i.e., if we get better and better individual p-scores the combined bleu score gets better and better.

BP — Stands for Brevity Penalty

Why is the penalty coefficient required? It turns out that if the MT output is very small, the precision automatically increases because there is a high chance that each of the output words in MT is present in at least one of the reference translations. In order to discourage short translations, BP is used to penalize translation systems that are too short.

--

--

Krishna Pullakandam

Content writer and AI enthusiast. I love to write about technology, business, and culture.