* Rework WASM compilation options
Necessary to work with newer versions of emscripten that are more picky about which option goes to the compiler, and which to the linker. Also took the opportunity to remove the need for the patching of the bergamot-translation-worker.js file, this can now easily be done through supported apis. Furthermore, I tried to downsize the generated javascript and wasm code a bit.
Initial estimates show that bergamot-translator compiled with emscripten 3.0.0 runs at about 3x the speed of 2.0.9 (when using embedded intgemm). Speed-up when using mozIntGemm is less dramatic.
* Updated marian-dev submodule
* Revert changes specific to patching external gemm modules for wasm
* Better Compilation and Link flags
- Added "-O3" optimization flag for linking as well
- "-g2" only for release and debug builds
- "-g1" for release builds
- Replaced deprecated "--bind" flag with "-lembind"
- Removed redundant link flag
* Upgraded emsdk to 3.1.8
* Enclosed EXPORTED_FUNCTIONS values in a list
* Fixed the remaining 2.0.9 reference in circle ci build script
* Updated README
Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl>
Imports python bindings and associated sources incubated in
https://github.com/jerinphilip/lemonade to bergamot-translator. Adds
a pybind11 dependency for python bindings.
Following the import, the python build is integrated into the existing
CMake based build system here. There is a command-line application
provided through python which provides the ability to fetch and prepare
models from model-repositories (like browsermt/students or OPUS).
Wheels built for a few common operating systems are provided via GitHub
releases through automated actions configured to run at tagged semantic
versions and pushes to main.
The documentation for python is also integrated into our existing
documentation setup. Previous documentation GitHub action is now
configured to run behind python builds in Ubuntu 18.04 Python3.7,
in order to pick up the packaged as a wheel bergamot module and the
sphinx documentation using the python module.
Formatting checks of black, isort with profile black and a pytype type
checker is configured for the python component residing in this repository.
* Updated marian-dev submodule
* Import wasm gemm from a separate wasm module
- The fallback implementation of gemm is currently being imported dynamically
for wasm target
* Updated CI scripts and README to import GEMM from a separate wasm module
* Setting model config to int8shiftAlphaAll in wasm test page
Service now allows loading Sentence-Splitter (non-breaking prefix file) from ByteArray. Behaviour is consistent with the rest of the ByteArray loads (model, shortlist), where first the ByteArray is checked if empty, if not fall back to loading from file-path.
Adds regression test to check if source-sentences in constructed Response match expected behaviour when the non-breaking-prefixes file is provided.
Bonus refactoring to remove an extra layer that existed for no reason.
* Partial test applications
Previously service-cli was used to generate output and accomplish
regression testing for all of: (1) translated-text (2) alignment tokens
+ scores (3) quality scores (4) indirectly annotation and tokenizations.
The --mode native now only outputs a faithful to source translated text
of the input source on stdin.
Test apps are separated into testing only individual functionalities.
This can help in independently testing ssplit-cpp, quality-scores for
the quality estimation implementation etc.
Separating numbers and text have the advantage of being able to compare
one with tolerance using BLEU (text) and some allowed error-rates
(numbers).
* Removing #mac tag
* Moving test apps to src/tests
* Tests are always on for CI
Unit tests are turned off looking for WASM_COMPATIBLE_SOURCES.
* Fixing WASM_COMPATIBLE_SOURCE -> USE_WASM_COMPATIBLE_SOURCE
* Workaround for now; CMakeLists.txt horrors are starting to bite
* BRT: use bergamot-test instead of bergamot now
* This should fix issues: CMakeLists.txt has so many paths
* Casing to camelCase and removing legacyServiceCli
* removing leftover service-cli declaration, some doc updates
* #pragma once is starting to look easier
* All the more reasons to do #pragma once
* Updating marian-dev with intgemm::kCPU print, resolved from INTGEMM_CPUID
* BRT: Use --gemm-highest-arch instead of python script
* Adding intgemm resolve here, where always(?) have intgemm on?
* intgemm-resolve in default binary directory
* BRT: Update to use intgemm-resolve
* marian-dev: Reset to without --gemm-highest-precision
Co-authored-by: Kenneth Heafield <kpu@users.noreply.github.com>
* Use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR for project bound version string
* marian-dev cmake fix
* Generate project.h in binary dir
* We don't want people asking about extra spaces
* Change WASM_COMPATIBLE_SOURCE=OFF by default
The default was WASN_COMPATIBLE_SOURCE=ON COMPILE_WASM=OFF which is a
testing configuration, not a sensible default for native or wasm.
* Always USE_WASM_COMPATIBLE_SOURCE with COMPILE_WASM
* Set CMP0077 to fix variable handling
* first attempt to enable vocabs pass as byte arrays
* pass vocabs bytes as AlignedMemory
* add vocabIndices to avoid double loading
* small fix on parameter names and documentation
* fix windows build plus tiny update on documentation
* update marian-dev submodule
* move validate model bytearray in BatchTranslator
* small refactors on validateBinaryModel()
* switch vocab memories to std::vector<marian::Ptr<AlignedMemory>>
* update marian-dev submodule
* replace marian::Ptr to std::shared_ptr for vocab memories
* add note for vocab memories
* Update marian-dev to the newest mac version
* Attempt windows workflow
* force workflow rerun
* Separate id
* Attempt 3 at github action
* Marian dev submodule now compiles with apple clang
* Updated ssplit version to something more recent
* Attempt to fix compile on wasm
* Do not compile subproject tests
* Fix emscripten compilation on Mac
* 99% on the way to windows compile
* Try with a different generator
* Build release not debug
* Revert CMakeLists.txt hacks
* Fix sse2 compilation failure
* MSVC settings for WIN32
* Add nodefaultlib LIBCMT
* Do not compile ssplit.cpp as it contains sys/mman.h
* Revert ab56b9aa4f
* Update paths
* Set the build type to release if not set previously
* Attempt to build release with the windows workflow
* Attempt 5 at VS studio release build
* Attempt 6 at getting release build on MSVC generator
* The windows build is debug at the moment...
* fix ssplit for ubuntu 16.04
* Fix compilation with clang
* Compile on ubuntu16.04
* Explain what is going on
* Updated ssplit and workflow
* Updated marian-dev submodule
- cmake changes required after the submodule update
* Added workflows for building custom marian on mac and ubuntu
* Renamed cmake option
- Renamed USE_WASM_COMPATIBLE_SOURCES to USE_WASM_COMPATIBLE_SOURCE
- Use proper compile defnitions
* Switch to wasm branch for this example
* Load marian model from a byte array
* Sanitise executable names
* Change marian branch
* Update marian branch that loads binary models
* Example of loading model as a byte array
* Add the byte array loading files
* Die on misaligned memory
* Remove the unused argument
* Allow loading without a ptr parameter so that we don't break emc workflow
Updates marian-dev and ssplit submodules to point to the upstream
commits which implements the following:
- marian-dev: encodeWithByteRanges(...) to get source token byte-ranges
- ssplit: Has a trivial sentencesplitter functionality implemented, and
now is faster to benchmark with marian-decoder.
This enables a marian-decoder replacement written through ssplit in this
source to be benchmarked constantly with existing marian-decoder.
Nits: Removes logging introduced for multiple workers, and respective
log statements.
- Added abhi-agg/ssplit-cpp
- Added its wasm branch in bergamot-translator
- Native builds of bergamot-translator are successful
-- Sentence splitting is NOT WORKING
-- Only translation is working
Enables Mac and Ubuntu CPU only builds through GitHub CI. CI scripts are
copied from marian-dev with necessary changes.
3rd-party/marian-dev is modified to meet C++17 requirements modifying
for half_float.
CMakeLists have been modified with the necessary includes to add
browsermt/mts@nuke files to the bergamot-translator library. In
addition, adds the ssplit dependency, corresponding includes.
Intel MKL fails on compilation, unable to find libraries. To solve this
3rd_party/CMakeLists.txt is modified with @ug's fixes to propogate
variables (EXT_LIBS, etc) at a library level.
Modifications to SentencePiece are necessary to provide token level
string_views. This commit changes marian to an alternate branch which
has the feature incorporated.