bergamot-translator

mirror of https://github.com/browsermt/bergamot-translator.git synced 2024-09-11 05:35:33 +03:00

Author	SHA1	Message	Date
Jelmer	e061b5613e	Treat most HTML elements as word-breaking (#286 )	2022-01-16 10:26:40 +00:00
Andre Barbosa	63120c174e	QualityEstimation: Preliminary Implementation (#197 ) Unifies quality estimation with an interface, refactors previously available quality scores to fit this interface. Adds a new class of model with Logistic Regression powering the predictions as an implementation of said interface. QE now provides annotations on words using subwords to word rule-based algorithms working with space characters. QualityEstimation ----------------- Implementations of QE are bound together by a `QualityEstimator` Interface. 1. The log-probabilities from the machine-translation model re-interpreted as quality scores are crafted as an implementation of QualityEstimator. 2. A Logistic-Regression based model is added. This class of models is trained supervised with scores labeled by a human annotator. Handcrafted features - number of words, log probs from MT model and statistics over the sequence are used to generate the numeric features. LogisticRegressor, Matrix (to hold features) are added. The creation of an instance is switched by the `AlignedMemory` supplied (be it loaded from the file-system or supplied as a parameter). An empty AlignedMemory leads to quality scores from NMT while supplying weights of a trained logistic-regression model in binary format as the contents lead to an additional pass through the said model to provide more refined scores. Both the above now transform subwords into "words" using a heuristic algorithm, scanning for spaces. This allows the client to work with "words" to denote quality instead of subwords, as the former is more sensible to the user. Testing ------- 1. BRT now has two new test apps to check the QE outputs in text (covers subword to words) and numbers domain (covers quality scores). These are tested with en-et models for which QualityEstimation is available now, on a new input to avoid architecture/compiler issues. 2. Unit test for LogisticRegression model is added. Docs ---- Doxygen now supports MathJax properly to render explanations for Logistic Regressions' reductions in place to make computation more efficient correctly. Co-authored-by: Felipe C. Dos Santos <felipe.santos.k@gmail.com> Co-authored-by: Jerin Philip <jerinphilip@live.in>	2021-09-16 16:28:40 +01:00
abhi-agg	c64deb50a8	Imported CI scripts from mozilla/bergamot-translator-old (#1 ) * CircleCI config, docs and badge * Increase CircleCI RAM from 4gb to 16gb Co-authored-by: Motin <motin@motin.eu>	2021-03-10 09:30:39 -08:00
Jerin Philip	10dcb8f548	Merge remote-tracking branch 'origin/wasm-integration' into jp/absorb-batch-translator Merging wasm-integration. Single thread codepath seems functional. Multithreading is broken.	2021-02-17 13:08:58 +00:00
Motin	49ad6514ae	Add reproducible docker-based builds + let test page use these by default	2021-02-15 11:27:47 +02:00
Motin	7030fa0157	Ignore test page bundled artifacts	2021-02-15 11:25:13 +02:00
Motin	e50dd0909f	Ignore contents in models directory	2021-02-15 11:23:08 +02:00
Andre Natal	1e413f71cd	Including a more elaborated test page, a node webserver containing the proper cors headers and wasm mimetype	2021-02-13 18:23:25 -08:00
Jerin Philip	38e8b3cd6d	Updates: marian-dev, ssplit for marian-decoder-new Updates marian-dev and ssplit submodules to point to the upstream commits which implements the following: - marian-dev: encodeWithByteRanges(...) to get source token byte-ranges - ssplit: Has a trivial sentencesplitter functionality implemented, and now is faster to benchmark with marian-decoder. This enables a marian-decoder replacement written through ssplit in this source to be benchmarked constantly with existing marian-decoder. Nits: Removes logging introduced for multiple workers, and respective log statements.	2021-02-12 14:23:24 +00:00
Jerin Philip	e75bd7eb57	Adding vim temporary files to .gitignore	2021-01-22 11:31:20 +00:00

10 Commits