Commit Graph

439 Commits

Author SHA1 Message Date
Nikolay Bogoychev
8563f0856f
Proper arch setting on win32 (#275)
* Proper arch detection on win32

* Whoops
2021-12-14 23:53:53 +00:00
Abhishek Aggarwal
feb9c90429
Additional logs in JS translation worker (#277)
- Print source text received in the response
 - Print no. of block elements in the input
2021-12-14 21:52:00 +01:00
Jerin Philip
571d312930
Constrain mistune to fix docs CI (#278) 2021-12-14 16:34:30 +00:00
Abhishek Aggarwal
e75a9e1da3
More robust logic to import wasm gemm (#276)
- Import optimized gemm implementation only if all the necessary functions
   are provided by it, othewise use the fallback gemm
2021-12-14 16:39:19 +01:00
Abhishek Aggarwal
8e79897f30
Updated configuration for html text translation to work in wasm test page (#269)
* Updated translator configuration in wasm test page
 - Added alignment: soft

* Set ResponseOptions::alignment to "true"
 - Had to be set for html text translation to work
2021-12-01 11:32:51 +01:00
Abhishek Aggarwal
e8fd01e9f4 Updated marian-dev submodule 2021-11-30 17:19:42 +01:00
Jelmer
eea5554b91
HTML handling improvements (#266)
* Fix out-of-bounds error when determining alignment for whole word

If token at offset 0 was a continuation (which it always is, since the first word of a sentence does not start with a space) it would jump to (unsigned) -1 which is probably out of bounds.

* Don't segfault if alignment info is not available

When alignment info is requested, but model is missing `alignment: soft` you'd get empty alignment info for every target token.

* Partial fix for handling empty elements

This fixes a parse error when dealing with something like `<p>...<br></p>` or `...<br>` where there is no text after the last empty element. This also prevents losing empty elements in the source side of the translation. Empty elements are not yet transferred correctly to the target side.

* Fix formatting
2021-11-29 08:41:24 +00:00
Kenneth Heafield
40366162d8
HTML input (#253)
Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl>
Co-authored-by: Abhishek Aggarwal <aaggarwal@mozilla.com>
2021-11-25 13:57:50 +00:00
Abhishek Aggarwal
2b1b0531ff
Import optimized gemm implementation (when available) for wasm target (#265)
* Enable importing optimized gemm module for wasm

 - Updated emscripten generated JS code to
   -- import and use the optimized gemm module when available, otherwise
     use fallback gemm implementation

* Added logging for gemm implementation being used for wasm target
2021-11-17 09:18:55 +01:00
Abhishek Aggarwal
f9e55b3cd8
Make script run from any directory (#262)
* Make script run from any directory
2021-11-15 22:30:52 +01:00
Andre Natal
d6a14b1d6f
Fix badge to point to this repo instead mozilla's (#261) 2021-11-15 08:14:21 +00:00
Jerin Philip
5a693b7eec
Fixes windows workflow for PCRE2 (#260) 2021-11-05 20:48:28 +00:00
Jerin Philip
fa4efb483b
Update ssplit cpp, pcre2 source compile to fix broken builds (#258)
* Update ssplit cpp, pcre2 source compile to fix tests

* Syncing with browsermt/ssplit-cpp

* Removing accidental binary inclusion

* Removing brt accidental update by git add -u

* Fix windows workflow, vcpkg is broken use our cmake route

* [ssplit-cpp] Try searching different library names for Windows
2021-11-05 16:46:03 +00:00
Abhishek Aggarwal
7693a1d007
Updated marian submodule (#256) 2021-11-03 13:54:48 +01:00
Jerin Philip
0bb8095bca
Deprecate hardAlignment in favour of softAlignment (#250) 2021-11-01 19:21:28 +00:00
Jerin Philip
806169c822
Recover logging (#226) 2021-11-01 16:31:01 +00:00
Abhishek Aggarwal
c5bc3f5191
Update config "skip-cost" to enable log probabilities for QE scores (#247)
- Updated wasm test page
2021-11-01 13:06:23 +01:00
Jerin Philip
9b443997e2
EXCLUDE_FROM_ALL for marian and ssplit-cpp 3rd-party libraries (#243) 2021-10-31 12:33:42 +00:00
Jerin Philip
47e57c95a6
[ssplit-cpp] Enable position independent library when compiled from sources (#240) 2021-10-29 13:40:28 +01:00
Jerin Philip
45412ce7de
Set PR to any branch to trigger workflows (#230) 2021-10-28 09:30:02 +01:00
Jerin Philip
2b98c67996
Cache for translations (#227)
Sets a cache to operate for each sentence that a TranslationModel process
caching the corresponding marian::History for a {TranslationModel::Id, marian::Words}
key.  Cache is thus shared across multiple TranslationModels bound to the lifetime
of a Service. Cache gracefully downgrades in the case of WebAssembly.
2021-10-27 20:37:05 +01:00
Abhishek Aggarwal
d0d08c0f54
JS bindings for Quality Estimation (#239)
* Quality Score bindings complete
* Updated wasm test page to test the bindings
  - Word and sentence scores can be seen in browser console
2021-10-27 19:26:55 +02:00
Abhishek Aggarwal
c5167b3d8c
Import matrix-multiply from a separate wasm module (#232)
* Updated marian-dev submodule
* Import wasm gemm from a separate wasm module
 - The fallback implementation of gemm is currently being imported dynamically
   for wasm target
* Updated CI scripts and README to import GEMM from a separate wasm module
* Setting model config to int8shiftAlphaAll in wasm test page
2021-10-27 11:54:39 +02:00
Abhishek Aggarwal
a0cb1e4b3d
Wasm test page UI for translating b/w non-English language pairs (#231)
* Updated Wasm test page UI for translating b/w non-English language pairs
* Both "from" and "to" language dropdowns now allow non-English languages
2021-10-19 14:40:54 +02:00
Abhishek Aggarwal
c7b626dfd0
Adapted wasm test page for new Service interface (#224)
- The new interface now supports running multiple TranslationModels
2021-09-28 15:53:02 +05:30
Jerin Philip
cf541c68f9
Multiple TranslationModels Implementation (#210)
For outbound translation, we require having multiple models in the
inventory at the same time and abstracting the "how-to-translate" 
using a model out.

Reorganization: TranslationModel + Service. The new entity which
contains everything required to translate in one direction is
`TranslationModel`. The how-to-translate blocking single-threaded mode
of operation or async multi-threaded mode of operation is decoupled as
`BlockingService` and `AsyncService`. There is a new regression-test
using multiple models in conjunction added, also serving as
a demonstration for using multiple models in Outbound Translation.

WASM: WebAssembly due to the inability to use threads uses
`BlockingService.  Bindings are provided with a new API to work with a
Service, and multiple TranslationModels which the client (JS extension)
can inventory and maintain.  Ownership of a given `TranslationModel` is
shared while translations using the model are active in the internal
mechanism.

Config-Parsing: So far bergamot-translator has been hijacking marian's
config-parsing mechanisms. However, in order to support multiple models,
it has become impractical to continue this approach and a new
config-parsing that is bergamot specific is provisioned for
command-line applications constituting tests. The original marian
config-parsing tooling is only associated with a subset of
`TranslationModel` now. The new config-parsing for the library manages
workers and other common options (tentatively).

There is a known issue of: Inefficient placing of workspaces, leading to
more memory usage than what's necessary. This is to be fixed trickling
down from marian-dev in a later pull request. 

This PR also brings in BRT changes which fix speed-tests that were
broken and also fixes some QE outputs which were different due to not
using shortlist.
2021-09-21 18:10:40 +01:00
Andre Barbosa
63120c174e
QualityEstimation: Preliminary Implementation (#197)
Unifies quality estimation with an interface, refactors previously available
quality scores to fit this interface. Adds a new class of  model with Logistic
Regression powering the predictions as an implementation of said interface. 
QE now provides annotations on words using subwords to word rule-based 
algorithms working with space characters. 

QualityEstimation
-----------------

Implementations of QE are bound together by a `QualityEstimator`
Interface. 

1. The log-probabilities from the machine-translation model re-interpreted
   as quality scores are crafted as an implementation of QualityEstimator.

2. A Logistic-Regression based model is added. This class of models is
   trained supervised with scores labeled by a human annotator.
   Handcrafted features - number of words, log probs from MT model and 
   statistics over the sequence are used to generate the numeric features.
   LogisticRegressor, Matrix (to hold features) are added.

The creation of an instance is switched by the `AlignedMemory` supplied
(be it loaded from the file-system or supplied as a parameter). An empty
AlignedMemory leads to quality scores from NMT while supplying weights
of a trained logistic-regression model in binary format as the contents
lead to an additional pass through the said model to provide more
refined scores.

Both the above now transform subwords into "words" using a heuristic
algorithm, scanning for spaces. This allows the client to work with "words"
to denote quality instead of subwords, as the former is more sensible to
the user.

Testing
-------

1. BRT now has two new test apps to check the QE outputs in text
  (covers subword to words) and numbers domain (covers quality scores).
  These are tested with en-et models for which QualityEstimation is
  available now, on a new input to avoid architecture/compiler issues.
2. Unit test for LogisticRegression model is added.


Docs
----

Doxygen now supports MathJax properly to render explanations for
Logistic Regressions' reductions in place to make computation more
efficient correctly.

Co-authored-by: Felipe C. Dos Santos <felipe.santos.k@gmail.com>
Co-authored-by: Jerin Philip <jerinphilip@live.in>
2021-09-16 16:28:40 +01:00
Jerin Philip
48e955c468
BRT: Update sacrebleu to get tests back working (#217)
Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>
2021-09-07 19:10:41 +01:00
Abhishek Aggarwal
8e4374282a Circle CI wasm artifacts for non-wormhole builds 2021-08-31 17:01:52 +02:00
Abhishek Aggarwal
cafb65e0b5 Wasm builds without SharedArrayBuffer 2021-08-27 09:07:06 +02:00
Abhishek Aggarwal
ff391c6f00 Updated marian submodule to latest commit of master 2021-08-27 09:07:06 +02:00
Abhishek Aggarwal
b64ffce496
Wasm test page using web workers now (#218) 2021-08-26 15:22:52 +02:00
Jerin Philip
972d8560b5
Add a clang-tidy run (#214)
Adds a clang-tidy run in addition to the existing clang-format checks.
The clang-tidy checks are not enforced, but is potentially useful to
point to during review.
2021-08-13 16:26:44 +01:00
Abhishek Aggarwal
9994d4acdb
Merge pull request #215 from abhi-agg/non-wormhole-builds
Wasm build instructions to run translations on other browsers
2021-08-11 19:19:13 +02:00
Abhishek Aggarwal
f3e00ae657 Added build instructions to run on other browsers
- Disabled compiling with wormhole which is Firefox specific feature
2021-08-11 13:28:15 +02:00
Jerin Philip
d31f96381b
Windows workflow: run-vcpkg7.{3->4}; vcpkg master (#208)
A cmake change has caused vcpkg to fail without much error message,
which is causing windows workflow runs to fail. Details in the following
link:

* https://github.com/microsoft/vcpkg/issues/18718

To fix, we're going with a version bump in vcpkg. Seeing that run-vcpkg
also seems to have gotten an update, updating run-vcpkg from 7.3 to 7.4
Playing with fire: vcpkg master commit
2021-07-29 12:25:09 +01:00
Abhishek Aggarwal
5a8fe209ce Wasm: Enabled sentence byte ranges in the wasm test page
- Use JS bindings to print all sentences individually on
   console
2021-07-19 12:06:22 +02:00
Abhishek Aggarwal
7052722cd2 JS bindings to return sentence byte ranges 2021-07-19 12:06:22 +02:00
Abhishek Aggarwal
6ad794fcef Added public methods in Response class to return sentences
- Refactored ByteRange struct and moved it to definition.h
2021-07-19 12:06:22 +02:00
Jerin Philip
a202e350c7
Change ResponseBuilder to accept callback instead of future (#142)
* Change ResponseBuilder to accept callback

Breaks things everywhere, now we follow the compiler to fix and convert
the std::future -> callback.

* More std::future -> callback

* std::future out of service.{h,cpp}

* compile is working, so is callback

* Some reshuffling of args

* Fixing merge error

* Fixing signature conflicts out of merge

* Fixing that test duct-taping future

* Minor adjustment to get that future back

* Add documentation for the new callback function

* Applying clang-format after update

* Using default responseOptions

* Remove future references from documentation

* translateMultiple only for WASM (#177)

* BRT: update to main; fresh-failures hopefully

* Converting test translateFromStdin to use callback

* BRT: Add fresh #native and #wasm tags

* future from promise, fix error

* Adding #native to GitHub CI

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>
2021-07-05 14:51:01 +01:00
Jerin Philip
cb855be1a7
maxLengthBreak_ -> wrapStep bugfix (#200) 2021-06-28 14:54:39 +01:00
Jerin Philip
13a1fe870f
Load sentence-splitter (non-breaking prefixes) from ByteArray
Service now allows loading Sentence-Splitter (non-breaking prefix file) from ByteArray. Behaviour is consistent with the rest of the ByteArray loads (model, shortlist), where first the ByteArray is checked if empty, if not fall back to loading from file-path. 

Adds regression test to check if source-sentences in constructed Response match expected behaviour when the non-breaking-prefixes file is provided. 

Bonus refactoring to remove an extra layer that existed for no reason.
2021-06-21 18:53:30 +01:00
Jerin Philip
44aa70a064
Account for EOS in both source and target annotations (#190) 2021-06-15 18:59:51 +01:00
Abhishek Aggarwal
b00116cb94
Refactor wasm bindings to use consistent interface names as in native (#195)
* Refactored wasm bindings code
 - Replaced TranslationModel, TranslationRequest and TranslationResult
    with Service, ResponseOptions and Response
 - Corresponding documentation changes
 - Names of the bindings files changed
 - Moved Vector<Response> definition in Response specific bindings
   file
2021-06-15 16:02:14 +02:00
Jerin Philip
4b014665ba
Removing alignments and quality-scores test-code (#196)
* Removing alignments and quality-scores test-code
* BRT: Update to main
2021-06-14 18:40:41 +01:00
Jerin Philip
e9e5ac6782
Partial test-apps and tolerance in evaluations (#184)
* Partial test applications

Previously service-cli was used to generate output and accomplish
regression testing for all of: (1) translated-text (2) alignment tokens
+ scores (3) quality scores (4) indirectly annotation and tokenizations.

The --mode native now only outputs a faithful to source translated text
of the input source on stdin.

Test apps are separated into testing only individual functionalities.
This can help in independently testing ssplit-cpp, quality-scores for
the quality estimation implementation etc.

Separating numbers and text have the advantage of being able to compare
one with tolerance using BLEU (text) and some allowed error-rates
(numbers).

* Removing #mac tag

* Moving test apps to src/tests

* Tests are always on for CI

Unit tests are turned off looking for WASM_COMPATIBLE_SOURCES.

* Fixing WASM_COMPATIBLE_SOURCE -> USE_WASM_COMPATIBLE_SOURCE

* Workaround for now; CMakeLists.txt horrors are starting to bite

* BRT: use bergamot-test instead of bergamot now

* This should fix issues: CMakeLists.txt has so many paths

* Casing to camelCase and removing legacyServiceCli

* removing leftover service-cli declaration, some doc updates

* #pragma once is starting to look easier

* All the more reasons to do #pragma once

* Updating marian-dev with intgemm::kCPU print, resolved from INTGEMM_CPUID

* BRT: Use --gemm-highest-arch instead of python script

* Adding intgemm resolve here, where always(?) have intgemm on?

* intgemm-resolve in default binary directory

* BRT: Update to use intgemm-resolve

* marian-dev: Reset to without --gemm-highest-precision

Co-authored-by: Kenneth Heafield <kpu@users.noreply.github.com>
2021-06-14 15:02:42 +01:00
Abhishek Aggarwal
16eb47f47e
Generating cmake configured project version (.js) file in build folder (#194)
- Earlier this file was being generated in folder containing
   actual sources

 - Fixes https://github.com/browsermt/bergamot-translator/issues/161
2021-06-09 13:57:23 +01:00
Jerin Philip
3039dea34b
Fixing if syntax with YAML var subsitution (#188) 2021-06-09 10:21:23 +01:00
Jerin Philip
dc2fb3d64e
CMake fixes: Generate project.h in binary dir, fix GetVersionFromFile for use as submodule. (#193)
* Use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR for project bound version string

* marian-dev cmake fix

* Generate project.h in binary dir

* We don't want people asking about extra spaces
2021-06-09 10:12:00 +01:00
Abhishek Aggarwal
3e46e3391c Consistent EMSDK version and parallel make jobs in README and github actions
- Set EMSDK version to 2.0.9 to make it consistent
   everywhere in repo
 - Set parallel make jobs to 2
2021-06-09 11:10:10 +02:00