Attempt to stop people from publishing non-comparable BLEU scores, as discussed in statmt meeting

2024-09-11 11:25:40 +03:00 · 2017-10-19 22:57:36 +01:00 · 2017-10-19 22:57:36 +01:00 · 545eee7e75
commit 545eee7e75
parent eced95d694
1 changed files with 3 additions and 0 deletions
--- a/scripts/generic/multi-bleu.perl
+++ b/scripts/generic/multi-bleu.perl
@ -168,6 +168,9 @@ printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_l
    $length_translation,
    $length_reference;

+
+print STDERR "Do not publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n";
+
 sub my_log {
  return -9999999999 unless $_[0];
  return log($_[0]);