mirror of https://github.com/browsermt/bergamot-translator.git synced 2024-08-15 08:30:46 +03:00

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.

cpp cross-platform emscripten machine-translation neural-machine-translation neural-networks python starred-browsermt-repo starred-repo wasm webassembly

Go to file

Jerin Philip bfb5e78602 Alignments + weak quality scores capability in Service (#46 ) * Draft adjustments to API * Adjustments to docs * Let's call the word + sentence ranges annotations * Editing confusing comment on size() * Fixing compilation for template adjustments for SentenceRanges * string_view template hacks This commit shifts AnnotatedBlob into a templated type and gets the troubled part to compile. All to manage absl::string_view and std::string_view. Objective: marian::bergamot stays C++ 11 to pluck and put in marian code, bergamot-translator somehow flexes C++17. Simplify development in one place. * Fixing the wiring: Gets source to build Runtime errors exist, but AnnotatedBlobs are consistent. * Bugfix: Matching old-state after factoring AnnotatedBlob in * Removing vocabs_ from Response. (For the umpteenth time). * Alignment API ready in marian::bergamot::Response * Wiring alignments upto TranslationResult * Adjustment to get alignments; bergamot-translator-app has alignments available * Accessing words instead of Ids This code sets up access of word string_views from annotations instead of printing Ids. However, we have segfault. This is likely due to targetRanges not being set, pending from https://github.com/browsermt/bergamot-translator/issues/25. Could also be a rogue EOS token which we're filtering for in string_view annotations, but not so in alignments. * Switching to browsermt/marian-dev@jp/decode-string-view for targetTokenRanges * Target word byte range annotations available Issues corresponding to #25 should be resolved. There is still a segfault. Could be due to EOS. Pending investigation. * Bugfix: Tokens for alignments are now through. Was not EOS. * browsermt/marian-dev@master ByteRange changes work downstream and has been merged to master. Updating submodule to point to master. * Style and documentation enhancements: response.cpp * Style and documentation enhancements: TranslationResult.h * Descriptions for SentenceRanges templating * Switching to marian-dev@wasm-sync * AnnotatedBlob can be copy-ctord/copy-assigned * TranslationResult: Empty ctor + WASM Bindings Allows empty construction of TranslationResult. Using this empty constructor, WASM bindings are adjusted. Unsure of the results, maybe @abhi-agg can test. * Cosmetic: SentenceRangesT -> Annotation - SentenceRangesT is renamed to AnnotationT; - Further comments to explain heavily templated files. * Response: Cleaning up unused members and adding docs * Adding quality scores - attempt * Stub QualityScores This adjustment adds capability to get "scores", which should potentially indicate how confident (at least relative in a target-sentence) should be. This enables writing the code forward for TranslationResult, and an example quality-score people can be pointed at. - These are not between [0,1] yet. - In addition, guards to check out-of-bounds access have been placed so illegal accesses are caught early on during development. * Removing token debug statements * Reworking Annotation without templates https://github.com/mozilla/bergamot-translator/issues/8 provides ByteRanges. - This ByteRange data-type is used in Annotation and converted to marian::string_view(=absl::string-view) on demand. - Since Annotation[using ByteRange] is not bound to anything else, it can be unit tested. A unit test is added (originally to test independently for integration after). - Annotation with ByteRange is now propogated across marian::bergamot and functionality matched to how it was previously working. This eliminates the string-view conversion and template code. * Nit: Removing std::endl flushes * Bring TranslationResult and Response closer Helps https://github.com/browsermt/bergamot-translator/issues/53. In preparation , the data-export types for Quality and Alignment are pushed down to Response from TranslationResult and computed during construction. This brings TranslationResult closer to Response, paving way to avoid having two TranslationResults. histories_ only remain for marian-decoder replacement usage, which can be removed in a separate PR. * Clean up hacks originally added for a unit-test to compile * Moving Annotation functions to cpp and documenting header file * Shifting alignments, qualityScore testing capability into main-mts * Restore Unified API files to previous state * Adaptations to fix Response with Quality, Alignments to connect to old Unified API * Missing reset on TranslationResultBindings * Cleaning up Response documentation to reflect newer code * Minor adjustments to get build back after main sync * Marian seems to make available Catch somehow * Disable COMPILE_BERGAMOT_TESTS for WASM * Add COMPILE_BERGAMOT_TESTS as a CMakeDependent option * Use the COMPILE_TESTS flag instead to skip macos.yml * Trigger unit-tests on GitHub runners for Annotation * Reordering enable_testing() to before inclusion of test directory * doc constructs required to operate with alignments Documents with doxygen compatible documentation for Response, AnnotatedBlob, Annotation, ByteRange. Incorporates doxygen compatible documentation for * Updates ByteRange consistent with general C++ Also little documentation enhancements in the process. * Updating marian-dev@9337105 * Copy-paste documentation because lazy * Turn off autoformat and manually edit to fix style changes * AnnotatedBlob -> AnnotatedText; blob -> text * text.text in test app renamed * text of text -> blob of text in places of documentation		2021-03-31 17:41:36 +01:00
.github/workflows	Alignments + weak quality scores capability in Service (#46 )	2021-03-31 17:41:36 +01:00
3rd_party	Update marian-dev submodule to master	2021-03-26 10:02:13 +01:00
app	Alignments + weak quality scores capability in Service (#46 )	2021-03-31 17:41:36 +01:00
doc	Marian compatible documentation tooling (#67 )	2021-03-24 17:00:53 +00:00
src	Alignments + weak quality scores capability in Service (#46 )	2021-03-31 17:41:36 +01:00
wasm	Patch WASM artifacts to run optimized (wormhole enabled) inference (#68 )	2021-03-24 17:10:42 +01:00
.gitignore	Merge remote-tracking branch 'origin/wasm-integration' into jp/absorb-batch-translator	2021-02-17 13:08:58 +00:00
.gitmodules	Update submodule ssplit-cpp	2021-03-03 11:48:56 +01:00
BERGAMOT_VERSION	Marian compatible documentation tooling (#67 )	2021-03-24 17:00:53 +00:00
CMakeLists.txt	Alignments + weak quality scores capability in Service (#46 )	2021-03-31 17:41:36 +01:00
Doxyfile.in	Marian compatible documentation tooling (#67 )	2021-03-24 17:00:53 +00:00
LICENSE	Initial commit	2020-10-19 13:49:38 +02:00
README.md	Patch WASM artifacts to run optimized (wormhole enabled) inference (#68 )	2021-03-24 17:10:42 +01:00

README.md

Bergamot Translator

Bergamot translator provides a unified API for (Marian NMT framework based) neural machine translation functionality in accordance with the Bergamot project that focuses on improving client-side machine translation in a web browser.

Build Instructions

Build Natively

Clone the repository using these instructions:

git clone https://github.com/browsermt/bergamot-translator
cd bergamot-translator

Compile

Create a folder where you want to build all the artifacts (build-native in this case) and compile in that folder
```
mkdir build-native
cd build-native
cmake ../
make -j
```

Build WASM

Compiling for the first time

Download and Install Emscripten using following instructions
- Get the latest sdk: git clone https://github.com/emscripten-core/emsdk.git
- Enter the cloned directory: cd emsdk
- Install the lastest sdk tools: ./emsdk install latest
- Activate the latest sdk tools: ./emsdk activate latest
- Activate path variables: source ./emsdk_env.sh

Clone the repository using these instructions:

git clone https://github.com/browsermt/bergamot-translator
cd bergamot-translator

Download files (only required if you want to package files in wasm binary)

This step is only required if you want to package files (e.g. models, vocabularies etc.) into wasm binary. If you don't then just skip this step.

The build preloads the files in Emscripten’s virtual file system.

If you want to package bergamot project specific models, please follow these instructions:
```
mkdir models
git clone --depth 1 --branch main --single-branch https://github.com/mozilla-applied-ml/bergamot-models
cp -rf bergamot-models/prod/* models
gunzip models/*/*
```
Compile
1. Create a folder where you want to build all the artefacts (build-wasm in this case)
```
mkdir build-wasm
cd build-wasm
```
2. Compile the artefacts
  - If you want to package files into wasm binary then execute following commands (Replace FILES_TO_PACKAGE with the path of the directory containing the files to be packaged in wasm binary)
```
emcmake cmake -DCOMPILE_WASM=on -DPACKAGE_DIR=FILES_TO_PACKAGE ../
emmake make -j
```
    e.g. If you want to package bergamot project specific models (downloaded using step 3 above) then replace FILES_TO_PACKAGE with ../models
  - If you don't want to package any file into wasm binary then execute following commands:
```
emcmake cmake -DCOMPILE_WASM=on ../
emmake make -j
```
  The wasm artifacts (.js and .wasm files) will be available in wasm folder of build directory ("build-wasm" in this case).
3. Enable SIMD Wormhole via Wasm instantiation API in generated artifacts
```
bash ../wasm/patch-artifacts-enable-wormhole.sh
```

Recompiling

As long as you don't update any submodule, just follow steps in 4.ii and 4.iii to recompile.
If you update a submodule, execute following command before executing steps in 4.ii and 4.iii to recompile.

git submodule update --init --recursive

How to use

Using Native version

The builds generate library that can be integrated to any project. All the public header files are specified in src folder.
A short example of how to use the APIs is provided in app/main.cpp file.

Using WASM version

Please follow the README inside the wasm folder of this repository that demonstrates how to use the translator in JavaScript.

README.md Unescape Escape