Commit Graph

407 Commits

Author SHA1 Message Date
Jerin Philip
05a8778497
Bump version to 0.4.5 (#427) 2022-06-21 17:49:07 +01:00
Jerin Philip
8771078177
Basic HTML property testing for WebAssembly (#425)
Import
https://gist.github.com/jelmervdl/a4c8b6b92ad88a885e1cbd51c6ad4902 and
attach it to CI.  NodeJS-14 is failing on trying to use the WebAssembly
binary. So we use node-16 independently setup.  This paves way for more
complicated testing for WebAssembly bindings in the future.
2022-06-21 14:07:17 +01:00
Jerin Philip
61d2c35dbd
Set up python packaging for pypi distribution (#424)
Old GitHub CI using Ubuntu and MacOS explicitly and building wheels have
been removed in favour of the more portable pypa specified builds. These
wheels should work just as well across a wider range of distributions.

pybind11:CMakeLists.txt requires Development.Module instead of
Development.* to avoid Embed from getting in the way of manylinux
builds.

manylinux_x86_64 builds are added for cp3.6 - 3.10. The linux build
uses an old image via docker.  Since the docker images are able to use
shared ccache folder, builds quite fast on warm starts.

ccache usage in setup.py is now triggered by an environment variable.
This allows for builds not to fail if ccache not present.

On tag pushes corresponding to versions, CI is configured to deliver
built wheels to PyPI, reading from repository secrets.

Improves setup.py including documentation and some formatting, and
additional links to source.

Fixes: #315
2022-06-20 14:35:29 +01:00
dependabot[bot]
ad781656fe
Bump 3rd_party/marian-dev from 199201e to e88c1aa (#416)
Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `199201e` to `e88c1aa`.
- [Release notes](https://github.com/browsermt/marian-dev/releases)
- [Commits](199201eb89...e88c1aa5d5)

---
updated-dependencies:
- dependency-name: 3rd_party/marian-dev
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-05-18 16:17:53 +01:00
Abhishek Aggarwal
5ae1b1ebb3
Bump version to 0.4.4 (#415) 2022-04-28 16:24:13 +02:00
Abhishek Aggarwal
e34420647d
Upgrade emsdk to 3.1.8 (#414)
* Rework WASM compilation options

Necessary to work with newer versions of emscripten that are more picky about which option goes to the compiler, and which to the linker. Also took the opportunity to remove the need for the patching of the bergamot-translation-worker.js file, this can now easily be done through supported apis. Furthermore, I tried to downsize the generated javascript and wasm code a bit.

Initial estimates show that bergamot-translator compiled with emscripten 3.0.0 runs at about 3x the speed of 2.0.9 (when using embedded intgemm). Speed-up when using mozIntGemm is less dramatic.

* Updated marian-dev submodule
* Revert changes specific to patching external gemm modules for wasm
* Better Compilation and Link flags

 - Added "-O3" optimization flag for linking as well
 - "-g2" only for release and debug builds
 - "-g1" for release builds
 - Replaced deprecated "--bind" flag with "-lembind"
 - Removed redundant link flag

* Upgraded emsdk to 3.1.8
* Enclosed EXPORTED_FUNCTIONS values in a list
* Fixed the remaining 2.0.9 reference in circle ci build script
* Updated README

Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl>
2022-04-20 00:39:32 +01:00
Jerin Philip
98af5945c5
Update and fix windows CI (#410)
* Use a more vanilla windows workflow from translateLocally, remove the
complicated lukka/*. Also removes port overrides in the overall upgrade.
* Disable vcpkg binary caching
* Remove PCRE library hacks after upstream ssplit improvements
2022-04-15 08:56:31 +01:00
dependabot[bot]
f18a8835fa
Bump 3rd_party/ssplit-cpp from a08d6bc to 49fde6d (#408)
Bumps [3rd_party/ssplit-cpp](https://github.com/browsermt/ssplit-cpp) from `a08d6bc` to `49fde6d`.
- [Release notes](https://github.com/browsermt/ssplit-cpp/releases)
- [Commits](a08d6bce20...49fde6df7e)

---
updated-dependencies:
- dependency-name: 3rd_party/ssplit-cpp
  dependency-type: direct:production
...
2022-04-14 11:25:51 +01:00
Jelmer
df5db52513
Fix call to isspace (#396)
Documentation is explicit about only calling it with unsigned char, and Windows runtime is checking this.
2022-03-31 12:12:33 +01:00
dependabot[bot]
7d51d109f7
Bump bergamot-translator-tests from d03a9d3 to 7984d14 (#394)
Bumps [bergamot-translator-tests](https://github.com/browsermt/bergamot-translator-tests) from `d03a9d3` to `7984d14`.
- [Release notes](https://github.com/browsermt/bergamot-translator-tests/releases)
- [Commits](d03a9d316d...7984d140ae)

---
updated-dependencies:
- dependency-name: bergamot-translator-tests
  dependency-type: direct:production
...
2022-03-30 09:41:15 +01:00
Abhishek Aggarwal
d2e3a82622
Bump version to 0.4.3 (#392) 2022-03-28 18:03:43 +02:00
Jerin Philip
13443352c0
Docs: Pin Jinja2 to last known working version (#389)
Fixes the docs workflow which is failing after pip is picking up Jinja 3.20. 
We only need >=2.3, this one sets it to 3.0.3 builds were successful last.
2022-03-24 19:26:20 +00:00
Jerin Philip
46882e7cfe
JS: Fix swap button on test-page (#388) 2022-03-24 15:05:45 +00:00
Jelmer
ed3160524d
JS: Update languages & use Intl API for their display names (#379)
Got the languages from registry.json, including non-prod models. 
Code now calls into `Intl.DisplayNames()`[1] to make life easier.

[1] (http://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/DisplayNames/DisplayNames)
2022-03-23 12:14:51 +00:00
dependabot[bot]
409b7d2265
Bump 3rd_party/marian-dev from 7e67124 to 844800e (#382)
Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `7e67124` to `844800e`.
- [Release notes](https://github.com/browsermt/marian-dev/releases)
- [Commits](7e67124ae0...844800efcc)

---
updated-dependencies:
- dependency-name: 3rd_party/marian-dev
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-03-18 11:27:53 +00:00
Abhishek Aggarwal
0a52a6d405
JS: Using supervised QE models for available language pairs (#378)
* JS: Refactored model loading
 - Passing single vocab memory via JS
* JS: Use supervised QE models when available
* Ran clang format
2022-03-15 15:55:28 +01:00
Abhishek Aggarwal
2c0e65c2ec
JS: Reuse Model registry from firefox-translation-models for test page (#377)
* JS: Reuse Model registry from firefox-translation-models repo for test page

 - https://github.com/mozilla/firefox-translations-models/blob/main/registry.json
   is reused
 - Removed existing registry
2022-03-14 18:05:22 +01:00
dependabot[bot]
22d6bc07e7
Bump 3rd_party/marian-dev from 08b1544 to 7e67124 (#372)
Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `08b1544` to `7e67124`.
- [Commits](08b1544636...7e67124ae0)

---
updated-dependencies:
- dependency-name: 3rd_party/marian-dev
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-09 08:00:28 +00:00
Abhishek Aggarwal
ab7f84f664
Bump version to 0.4.2 (#371) 2022-03-07 18:38:17 +01:00
Abhishek Aggarwal
89a96bf71e
Use right range and threshold for showing "bad" words/sentences (#370)
* Use ln(0.5) as the threshold
* Use right range for showing "bad" words/sentences
2022-03-03 17:24:32 +01:00
Jerin Philip
1360941ab9
Enable dependabot to automate updating dependencies (#365)
Following marian-nmt/marian-dev.
2022-03-03 11:41:26 +00:00
Jelmer
fe3f3982de
Embed quality-scores as HTML tag attributes (#358)
Quality scores for HTML translation exposed as <font
x-bergamot-sentence-score=""> and <font x-bergamot-word-score=""> tags
in the HTML output. While this increases the size of the HTML returned,
the resulting rendered HTML can easily be styled to show the scores.
With Javascript or CSS, developers can easily have some interface based
on these extra attributes.

Also includes updates to the test page to show a proof-of-concept 
demonstration.

Fixes: #355
2022-02-25 22:01:32 +00:00
Jerin Philip
96b0f82343
Simplify cache config and bind for use in JS (#359)
Deprecates cacheEnabled parameter to be replaced with cacheSize=0.
Python bindings, Documentation in comments and tests updated to reflect
this change.

Exposes the fields corresponding to cache via embind as a value object.
The equivalent object-based syntax in worker.js allows propagation
from JS.

Fixes: #351
See also: mozilla/firefox-translations#96
2022-02-23 13:25:12 +00:00
Jelmer
1f98f971a5
Improve handling HTML special cases (#312)
- Prefer spreading markup over a full word.
- Ignore certain tags that are unlikely to be supposed to be translated,
  such as `<code>` and `<samp>`.
- Never treat `<wbr>` as a space.
- Allow for inconsistent cases in tag names.
- Fix bug where void elements were inserted multiple times.
- Better handling of whitespace around punctuation.
- Ignore parsing `<noscript>` to be compatible with Firefox.
- Improvements to documentation and readability of `HTML` and `Scanner`
  classes.

Fixes: #313, #339
2022-02-22 20:25:34 +00:00
Abhishek Aggarwal
9eb243725b
Bump version to 0.4.1 (#356) 2022-02-17 23:32:13 +01:00
Abhishek Aggarwal
6ccd4c68e8
Create github release via CircleCI only for mozilla fork (#349)
* Create github release via circleci only for mozilla fork

 - The extension uses mozilla fork for translator artifacts
   -- Hence create github release via circleci only when
      running in mozilla fork

* Small refactoring in ci script
2022-02-17 18:32:57 +01:00
Abhishek Aggarwal
2844cedb0d
JS: Refactoring wasm test page (#354)
* Free all the objects properly that were constructed for translation api
* Refactored pivot detection mechanism
2022-02-17 14:16:26 +01:00
Jerin Philip
9f55fb4756
Improve cache (#347)
Hide `cache-mutex-buckets` from the user. Now configured to be equal to number
of workers. Python bindings which had exposed these are modified to reflect
the API change. `std::optional` enabled on cache, constructed only if enabled.
Pointers used are replaced with an equivalent `std::optional.`

Fixes: #317
2022-02-15 11:04:07 +00:00
Kenneth Heafield
a94725b20d
Update aligned vector following intgemm 1b8cbd6f611c21011325cfe0312940f0635dea33 (#334)
Fixes memory leak
ifdef for -fno-exceptions including clang-cl
Move spacing back to intgemm upstream

Co-authored-by: Jerin Philip <jerin.philip@research.iiit.ac.in>
2022-02-14 14:26:06 +00:00
Abhishek Aggarwal
c76e630e00
JS/WASM: Passing ResponseOptions for every item for translation batch api (#348)
- Now translate() JS API accepts ResponseOptions per batch item

 - Fixed the logic to create vector<ResponseOption>
2022-02-14 13:16:33 +01:00
Jerin Philip
ec469193c6
Allow per-input options (#346)
Changes signature of BlockingService::{translate,pivot}Multiple
functions to take per input options, so a mix of HTML and plaintext
can be sent from the extension. Templating over testing is adjusted
to allow for continuous evaluations by modifying the test code.

Updates WebAssembly bindings to reflect the change in signature
and the javascript test-page to work with the new bindings.

This change lacks an accompanying test specific to the mixed HTML
and plaintext inputs.

Fixes: #345
See also: mozilla/firefox-translations#94
Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl>
2022-02-11 13:06:26 +00:00
Jerin Philip
34786520cd
Add ability to load .npz models (#342)
Changes `ABORT` on non `.bin` model to an additional check for a `.npz` 
extension. If `.bin`, the fast load path is activated by returning `AlignedMemory`. 
Otherwise, the return of empty `AlignedMemory` causes fallback to
filesystem-based loads.

BRT: A test that checks if translation using `.npz` is approximately similar to 
that of default CLI translation is checked in to ensure stability going ahead.

Previously, we only supported `.bin` models' loading via a fast mmap 
path. While we had the underlying capability to load non `.bin` models, this 
was not exposed, encouraging fast loads. Loading `.npz` models are helpful 
for quick debugging and broader coverage of models available, which will 
enhance user experience at translateLocally and python bindings. 


Fixes #341.
See also: XapaJIaMnu/translateLocally#89
2022-02-09 19:37:30 +00:00
Jelmer
80bd4e7651
Print errors by default in WASM build (#343)
* Remove BadHTML exception in favour of ABORT macro
   `ABORT()` gives us readable error messages, even when exception support is disabled.
* Control marian exception global setting in tests through fixture
* WASM: construct BlockingService with critical logging by default
   This log level is only used by ABORT()

See also: 
- mozilla/firefox-translations#65, 
- mozilla/firefox-translations#68
- mozilla/firefox-translations#70 
- mozilla/firefox-translations#56
2022-02-09 12:54:36 +00:00
Abhishek Aggarwal
6b2a855234
JS/WASM: Re-enable importing optimized gemm module for (#336)
- Re-enabled the code that imports optimized gemm module
   for wasm when available
2022-02-07 16:55:31 +01:00
Kenneth Heafield
f6d9233dc4 Revert "Revert "Make default throw exception on abort for python (#333)""
This reverts commit 62ff781ed4.

Sorry I should have realized Jerin was only amending python and
therefore this didn't break WASM.

Apologies to Jerin on this.
2022-02-07 14:28:31 +00:00
Kenneth Heafield
62ff781ed4 Revert "Make default throw exception on abort for python (#333)"
This reverts commit 97bd6e36db.

As discussed, we need messages for debugging in -fno-exceptions.
2022-02-05 17:26:16 +00:00
Jerin Philip
97bd6e36db
Make default throw exception on abort for python (#333)
This also allows conversion of exiting aborts into runtime errors in python, 
providing informative messages to the user via pybind11 existing tooling.
2022-02-05 17:25:29 +00:00
Jerin Philip
b1e5a48f1a
Increment version to v0.4.0 (#328) 2022-02-05 10:42:44 +00:00
Jerin Philip
5e78260d52
Consolidate release artefacts (#329)
Brings in the previously wasm.yml into python.yml and the new file is
renamed as build.yml.

python.yml already had a version and pre-release jobs. These jobs
downloaded artefacts from prior ran jobs (python wheel builds). The
newly attached emscripten build now uploads artefacts of a WebAssembly
binary and javascript file which are fed into the release and
pre-release jobs in addition to the existing python builds.
2022-02-04 11:54:30 +00:00
Jerin Philip
91b2e0636d
emscripten: ccache and artefact upload (#325)
Enables ccache for emscripten. The configuration uses pyiodide for a
reference (https://github.com/pyodide/pyodide/pull/1805).

Two workflows to run on macOS and Ubuntu, reduced to one on Ubuntu. As
emscripten and the target is cross-platform, also macOS runners being
limited - it makes sense to have this removed.

Upload artefact enabled in preparation for a release action to be
scheduled which will upload the bergamot*.wasm and bergamot*.js for
consumption.
2022-02-02 19:21:42 +00:00
Abhishek Aggarwal
d95b014562
Wasm/JS: Pivot translation API JS binding and test page update (#327) 2022-02-02 17:01:23 +01:00
Jerin Philip
19ae519c63
Remove obsolete workflow transferring source across forks (#326) 2022-02-02 12:36:30 +00:00
Jerin Philip
95de806d1d
Fix HTML with pivoting (#323)
Previously BlockingService pivoting missed preproc and postproc for HTML
leading to issues in WebAssembly API. This change adds fixes for the
same, along with test coverage for the functionality over both async and
blocking services.
2022-02-01 13:31:11 +00:00
Jerin Philip
cfdda155e2
BRT: Update to fix QE download failures (#321) 2022-01-31 19:03:17 +00:00
Jerin Philip
c0f311a8c0
Batteries included python package (#310)
Imports python bindings and associated sources incubated in
https://github.com/jerinphilip/lemonade to bergamot-translator. Adds
 a pybind11 dependency for python bindings.

Following the import, the python build is integrated into the existing 
CMake based build system here. There is a command-line application 
provided through python which provides the ability to fetch and prepare 
models from model-repositories (like browsermt/students or OPUS).

Wheels built for a few common operating systems are provided via GitHub
releases through automated actions configured to run at tagged semantic
versions and pushes to main.

The documentation for python is also integrated into our existing
documentation setup. Previous documentation GitHub action is now
configured to run behind python builds in Ubuntu 18.04 Python3.7,
in order to pick up the packaged as a wheel bergamot module and the
sphinx documentation using the python module.

Formatting checks of black, isort with profile black and a pytype type
checker is configured for the python component residing in this repository.
2022-01-26 20:33:43 +00:00
Jerin Philip
3dde0fe245
Remove unused compiler hash script (#309) 2022-01-24 17:36:17 +00:00
Jerin Philip
495f98dd0d
Speed up Windows CI with ccache (#308)
Use https://github.com/cristianadam/ccache/releases/ to speed up windows
compilation.

Remove /Zi as it is unsupported by ccache at the moment. This is a debug
flag that was removed in upstream marian-dev
https://github.com/browsermt/marian-dev/pull/43. However, the bergamot
CMakeLists.txt which was originally taken from
marian maintained this under MSCV.
2022-01-22 18:41:04 +00:00
Jelmer
aef76c03a4
Add API to trigger fast shutdown of AsyncService (#297)
Add a way to AsyncService to shut down without finishing the full queue
through `AsyncService::clear()`. The default behaviour is that
`AsyncService::~AsyncService()` will wait for any pending translation
requests to finish.

One can call `AsyncService::clear()` before the calls to the destructor
to ensure there is no work for the service to finish before the workers
can stop and join. Marian batches that are already in progress will not
stop. We are not trying to cause interrupts in threads or something that
complex. However, these single batches often do not take that long to
complete.

Changes:

 - Add clear() to AsyncService
 - Add clear() to BatchingPool
 - Documentation

See also:  XapaJIaMnu/translateLocally#80
2022-01-21 13:14:57 +00:00
Jerin Philip
7099b9e9ad
Streamline memory-bundle loads (#307)
Provides an additional constructor which takes care of the bundle
loading inside the boundary of the source here, when a configuration
file is supplied from a client like translateLocally or python bindings.
Once the config file is read, we have access to the information required
to construct the MemoryBundle.

 - The command-line application supplied from here, app/bergamot is
   configured to use the fast-load path now.
 - Changes to binary-loading additionally revealed a bug in the
   example-run script used in docs and tied to CI and the fix is
   included.
 - Shortlist is made optional in the memory bundle, making changes to
   getModelMemoryFromConfig.

Fixes #304.
Fixes #306.
See also: XapaJIaMnu/translateLocally#82.
2022-01-19 16:36:48 +00:00
Jelmer
acbc46d816
Accept XHTML-style self-closing void tags (#305)
Allow the self-closing `/>` end for void tags. For non-void tags these
were already "allowed" due to how the HTML parser works, but for
elements where they actually occur, like `<br/>`, they caused a parse
error. Support for them was not implemented since we only expect valid
HTML5, e.g. the output of Firefox' Element.innerHTML.

Use case: TranslateLocally uses Qt's HTML representation of rich text.
That HTML uses self-closing tags like `<meta .../>` and `<br/>`.
Implementing a string replace operation that would only match these
elements without parsing HTML is tricky. Fixing it in
bergamot-translator is not.

Implementation: Currently `<img>` is marked as a void tag (an element
which cannot have children or text, and therefore treated differently.
Since void tags normally have no close tag, they are treated as
immediately closed. The HTML parser we use reads `<img/>` as
`<img></img>` which thus causes a problem since now we close an element
that was never open, to begin with.

This fix ignores the `TT_TAG_END` token from the parser when the tag
name is that of a void tag.
2022-01-19 09:22:46 +00:00