Merged PR 11434: Fixes empty line handling with factored segmenter

Fixes empty line handling with factored segmenter. In my previous PR where I fixed general empty line handling I misunderstood the relation between WordIndex and factors and did an incorrect inverse look-up of the word index of EOS. Should be fixed now for FS, Should be no change when not using FS.
This commit is contained in:
Martin Junczys-Dowmunt 2020-02-07 01:22:21 +00:00
parent b3a23108b4
commit 1044f7f587

View File

@ -74,10 +74,20 @@ public:
const auto currentBatchIdx = (key / vocabSize) / nBestBeamSize;
const auto origBatchIdx = reverseBatchIdxMap.empty() ? currentBatchIdx : reverseBatchIdxMap[currentBatchIdx]; // map currentBatchIdx back into original position within starting maximal batch size, required to find correct beam
bool dropHyp = !dropBatchEntries.empty() && dropBatchEntries[origBatchIdx];
// if we force=drop the hypothesis, assign EOS, otherwise the expected word id.
const auto wordIdx = dropHyp ? trgVocab_->getEosId().toWordIndex() : (WordIndex)(key % vocabSize);
bool dropHyp = !dropBatchEntries.empty() && dropBatchEntries[origBatchIdx] && factorGroup == 0;
WordIndex wordIdx;
if(dropHyp) { // if we force=drop the hypothesis, assign EOS, otherwise the expected word id.
if(factoredVocab) { // when using factoredVocab, extract the EOS lemma index from the word id, we predicting factors one by one here, hence lemma only
std::vector<size_t> eosFactors;
factoredVocab->word2factors(factoredVocab->getEosId(), eosFactors);
wordIdx = eosFactors[0];
} else { // without factoredVocab lemma index and word index are the same. Safe cruising.
wordIdx = trgVocab_->getEosId().toWordIndex();
}
} else { // we are not dropping anything, just assign the normal index
wordIdx = (WordIndex)(key % vocabSize);
}
// @TODO: We currently assign a log probability of 0 to all beam entries of the dropped batch entry, instead it might be a good idea to use
// the per Hyp pathScore without the current expansion (a bit hard to obtain).