Unifies quality estimation with an interface, refactors previously available
quality scores to fit this interface. Adds a new class of model with Logistic
Regression powering the predictions as an implementation of said interface.
QE now provides annotations on words using subwords to word rule-based
algorithms working with space characters.
QualityEstimation
-----------------
Implementations of QE are bound together by a `QualityEstimator`
Interface.
1. The log-probabilities from the machine-translation model re-interpreted
as quality scores are crafted as an implementation of QualityEstimator.
2. A Logistic-Regression based model is added. This class of models is
trained supervised with scores labeled by a human annotator.
Handcrafted features - number of words, log probs from MT model and
statistics over the sequence are used to generate the numeric features.
LogisticRegressor, Matrix (to hold features) are added.
The creation of an instance is switched by the `AlignedMemory` supplied
(be it loaded from the file-system or supplied as a parameter). An empty
AlignedMemory leads to quality scores from NMT while supplying weights
of a trained logistic-regression model in binary format as the contents
lead to an additional pass through the said model to provide more
refined scores.
Both the above now transform subwords into "words" using a heuristic
algorithm, scanning for spaces. This allows the client to work with "words"
to denote quality instead of subwords, as the former is more sensible to
the user.
Testing
-------
1. BRT now has two new test apps to check the QE outputs in text
(covers subword to words) and numbers domain (covers quality scores).
These are tested with en-et models for which QualityEstimation is
available now, on a new input to avoid architecture/compiler issues.
2. Unit test for LogisticRegression model is added.
Docs
----
Doxygen now supports MathJax properly to render explanations for
Logistic Regressions' reductions in place to make computation more
efficient correctly.
Co-authored-by: Felipe C. Dos Santos <felipe.santos.k@gmail.com>
Co-authored-by: Jerin Philip <jerinphilip@live.in>
Updates marian-dev and ssplit submodules to point to the upstream
commits which implements the following:
- marian-dev: encodeWithByteRanges(...) to get source token byte-ranges
- ssplit: Has a trivial sentencesplitter functionality implemented, and
now is faster to benchmark with marian-decoder.
This enables a marian-decoder replacement written through ssplit in this
source to be benchmarked constantly with existing marian-decoder.
Nits: Removes logging introduced for multiple workers, and respective
log statements.