Merged PR 11434: Fixes empty line handling with factored segmenter

Fixes empty line handling with factored segmenter. In my previous PR where I fixed general empty line handling I misunderstood the relation between WordIndex and factors and did an incorrect inverse look-up of the word index of EOS. Should be fixed now for FS, Should be no change when not using FS.
2024-09-11 06:15:56 +03:00 · 2020-02-07 01:22:21 +00:00 · 2020-02-07 01:22:21 +00:00 · 1044f7f587
commit 1044f7f587
parent b3a23108b4
1 changed files with 14 additions and 4 deletions
--- a/src/translator/beam_search.h
+++ b/src/translator/beam_search.h
@ -74,10 +74,20 @@ public:
      const auto currentBatchIdx = (key / vocabSize) / nBestBeamSize;
      const auto origBatchIdx    = reverseBatchIdxMap.empty() ? currentBatchIdx : reverseBatchIdxMap[currentBatchIdx]; // map currentBatchIdx back into original position within starting maximal batch size, required to find correct beam

-      bool dropHyp = !dropBatchEntries.empty() && dropBatchEntries[origBatchIdx];
-
-      // if we force=drop the hypothesis, assign EOS, otherwise the expected word id. 
-      const auto wordIdx    = dropHyp ? trgVocab_->getEosId().toWordIndex() : (WordIndex)(key % vocabSize);
+      bool dropHyp = !dropBatchEntries.empty() && dropBatchEntries[origBatchIdx] && factorGroup == 0;
+      
+      WordIndex wordIdx;
+      if(dropHyp) { // if we force=drop the hypothesis, assign EOS, otherwise the expected word id.
+        if(factoredVocab) { // when using factoredVocab, extract the EOS lemma index from the word id, we predicting factors one by one here, hence lemma only
+          std::vector<size_t> eosFactors;
+          factoredVocab->word2factors(factoredVocab->getEosId(), eosFactors);
+          wordIdx = eosFactors[0]; 
+        } else { // without factoredVocab lemma index and word index are the same. Safe cruising. 
+          wordIdx = trgVocab_->getEosId().toWordIndex();
+        }
+      } else { // we are not dropping anything, just assign the normal index
+        wordIdx = (WordIndex)(key % vocabSize);
+      }

      // @TODO: We currently assign a log probability of 0 to all beam entries of the dropped batch entry, instead it might be a good idea to use
      // the per Hyp pathScore without the current expansion (a bit hard to obtain).