Commit Graph

432 Commits

Author SHA1 Message Date
Abhishek Aggarwal
6ccd4c68e8
Create github release via CircleCI only for mozilla fork (#349)
* Create github release via circleci only for mozilla fork

 - The extension uses mozilla fork for translator artifacts
   -- Hence create github release via circleci only when
      running in mozilla fork

* Small refactoring in ci script
2022-02-17 18:32:57 +01:00
Abhishek Aggarwal
2844cedb0d
JS: Refactoring wasm test page (#354)
* Free all the objects properly that were constructed for translation api
* Refactored pivot detection mechanism
2022-02-17 14:16:26 +01:00
Jerin Philip
9f55fb4756
Improve cache (#347)
Hide `cache-mutex-buckets` from the user. Now configured to be equal to number
of workers. Python bindings which had exposed these are modified to reflect
the API change. `std::optional` enabled on cache, constructed only if enabled.
Pointers used are replaced with an equivalent `std::optional.`

Fixes: #317
2022-02-15 11:04:07 +00:00
Kenneth Heafield
a94725b20d
Update aligned vector following intgemm 1b8cbd6f611c21011325cfe0312940f0635dea33 (#334)
Fixes memory leak
ifdef for -fno-exceptions including clang-cl
Move spacing back to intgemm upstream

Co-authored-by: Jerin Philip <jerin.philip@research.iiit.ac.in>
2022-02-14 14:26:06 +00:00
Abhishek Aggarwal
c76e630e00
JS/WASM: Passing ResponseOptions for every item for translation batch api (#348)
- Now translate() JS API accepts ResponseOptions per batch item

 - Fixed the logic to create vector<ResponseOption>
2022-02-14 13:16:33 +01:00
Jerin Philip
ec469193c6
Allow per-input options (#346)
Changes signature of BlockingService::{translate,pivot}Multiple
functions to take per input options, so a mix of HTML and plaintext
can be sent from the extension. Templating over testing is adjusted
to allow for continuous evaluations by modifying the test code.

Updates WebAssembly bindings to reflect the change in signature
and the javascript test-page to work with the new bindings.

This change lacks an accompanying test specific to the mixed HTML
and plaintext inputs.

Fixes: #345
See also: mozilla/firefox-translations#94
Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl>
2022-02-11 13:06:26 +00:00
Jerin Philip
34786520cd
Add ability to load .npz models (#342)
Changes `ABORT` on non `.bin` model to an additional check for a `.npz` 
extension. If `.bin`, the fast load path is activated by returning `AlignedMemory`. 
Otherwise, the return of empty `AlignedMemory` causes fallback to
filesystem-based loads.

BRT: A test that checks if translation using `.npz` is approximately similar to 
that of default CLI translation is checked in to ensure stability going ahead.

Previously, we only supported `.bin` models' loading via a fast mmap 
path. While we had the underlying capability to load non `.bin` models, this 
was not exposed, encouraging fast loads. Loading `.npz` models are helpful 
for quick debugging and broader coverage of models available, which will 
enhance user experience at translateLocally and python bindings. 


Fixes #341.
See also: XapaJIaMnu/translateLocally#89
2022-02-09 19:37:30 +00:00
Jelmer
80bd4e7651
Print errors by default in WASM build (#343)
* Remove BadHTML exception in favour of ABORT macro
   `ABORT()` gives us readable error messages, even when exception support is disabled.
* Control marian exception global setting in tests through fixture
* WASM: construct BlockingService with critical logging by default
   This log level is only used by ABORT()

See also: 
- mozilla/firefox-translations#65, 
- mozilla/firefox-translations#68
- mozilla/firefox-translations#70 
- mozilla/firefox-translations#56
2022-02-09 12:54:36 +00:00
Abhishek Aggarwal
6b2a855234
JS/WASM: Re-enable importing optimized gemm module for (#336)
- Re-enabled the code that imports optimized gemm module
   for wasm when available
2022-02-07 16:55:31 +01:00
Kenneth Heafield
f6d9233dc4 Revert "Revert "Make default throw exception on abort for python (#333)""
This reverts commit 62ff781ed4.

Sorry I should have realized Jerin was only amending python and
therefore this didn't break WASM.

Apologies to Jerin on this.
2022-02-07 14:28:31 +00:00
Kenneth Heafield
62ff781ed4 Revert "Make default throw exception on abort for python (#333)"
This reverts commit 97bd6e36db.

As discussed, we need messages for debugging in -fno-exceptions.
2022-02-05 17:26:16 +00:00
Jerin Philip
97bd6e36db
Make default throw exception on abort for python (#333)
This also allows conversion of exiting aborts into runtime errors in python, 
providing informative messages to the user via pybind11 existing tooling.
2022-02-05 17:25:29 +00:00
Jerin Philip
b1e5a48f1a
Increment version to v0.4.0 (#328) 2022-02-05 10:42:44 +00:00
Jerin Philip
5e78260d52
Consolidate release artefacts (#329)
Brings in the previously wasm.yml into python.yml and the new file is
renamed as build.yml.

python.yml already had a version and pre-release jobs. These jobs
downloaded artefacts from prior ran jobs (python wheel builds). The
newly attached emscripten build now uploads artefacts of a WebAssembly
binary and javascript file which are fed into the release and
pre-release jobs in addition to the existing python builds.
2022-02-04 11:54:30 +00:00
Jerin Philip
91b2e0636d
emscripten: ccache and artefact upload (#325)
Enables ccache for emscripten. The configuration uses pyiodide for a
reference (https://github.com/pyodide/pyodide/pull/1805).

Two workflows to run on macOS and Ubuntu, reduced to one on Ubuntu. As
emscripten and the target is cross-platform, also macOS runners being
limited - it makes sense to have this removed.

Upload artefact enabled in preparation for a release action to be
scheduled which will upload the bergamot*.wasm and bergamot*.js for
consumption.
2022-02-02 19:21:42 +00:00
Abhishek Aggarwal
d95b014562
Wasm/JS: Pivot translation API JS binding and test page update (#327) 2022-02-02 17:01:23 +01:00
Jerin Philip
19ae519c63
Remove obsolete workflow transferring source across forks (#326) 2022-02-02 12:36:30 +00:00
Jerin Philip
95de806d1d
Fix HTML with pivoting (#323)
Previously BlockingService pivoting missed preproc and postproc for HTML
leading to issues in WebAssembly API. This change adds fixes for the
same, along with test coverage for the functionality over both async and
blocking services.
2022-02-01 13:31:11 +00:00
Jerin Philip
cfdda155e2
BRT: Update to fix QE download failures (#321) 2022-01-31 19:03:17 +00:00
Jerin Philip
c0f311a8c0
Batteries included python package (#310)
Imports python bindings and associated sources incubated in
https://github.com/jerinphilip/lemonade to bergamot-translator. Adds
 a pybind11 dependency for python bindings.

Following the import, the python build is integrated into the existing 
CMake based build system here. There is a command-line application 
provided through python which provides the ability to fetch and prepare 
models from model-repositories (like browsermt/students or OPUS).

Wheels built for a few common operating systems are provided via GitHub
releases through automated actions configured to run at tagged semantic
versions and pushes to main.

The documentation for python is also integrated into our existing
documentation setup. Previous documentation GitHub action is now
configured to run behind python builds in Ubuntu 18.04 Python3.7,
in order to pick up the packaged as a wheel bergamot module and the
sphinx documentation using the python module.

Formatting checks of black, isort with profile black and a pytype type
checker is configured for the python component residing in this repository.
2022-01-26 20:33:43 +00:00
Jerin Philip
3dde0fe245
Remove unused compiler hash script (#309) 2022-01-24 17:36:17 +00:00
Jerin Philip
495f98dd0d
Speed up Windows CI with ccache (#308)
Use https://github.com/cristianadam/ccache/releases/ to speed up windows
compilation.

Remove /Zi as it is unsupported by ccache at the moment. This is a debug
flag that was removed in upstream marian-dev
https://github.com/browsermt/marian-dev/pull/43. However, the bergamot
CMakeLists.txt which was originally taken from
marian maintained this under MSCV.
2022-01-22 18:41:04 +00:00
Jelmer
aef76c03a4
Add API to trigger fast shutdown of AsyncService (#297)
Add a way to AsyncService to shut down without finishing the full queue
through `AsyncService::clear()`. The default behaviour is that
`AsyncService::~AsyncService()` will wait for any pending translation
requests to finish.

One can call `AsyncService::clear()` before the calls to the destructor
to ensure there is no work for the service to finish before the workers
can stop and join. Marian batches that are already in progress will not
stop. We are not trying to cause interrupts in threads or something that
complex. However, these single batches often do not take that long to
complete.

Changes:

 - Add clear() to AsyncService
 - Add clear() to BatchingPool
 - Documentation

See also:  XapaJIaMnu/translateLocally#80
2022-01-21 13:14:57 +00:00
Jerin Philip
7099b9e9ad
Streamline memory-bundle loads (#307)
Provides an additional constructor which takes care of the bundle
loading inside the boundary of the source here, when a configuration
file is supplied from a client like translateLocally or python bindings.
Once the config file is read, we have access to the information required
to construct the MemoryBundle.

 - The command-line application supplied from here, app/bergamot is
   configured to use the fast-load path now.
 - Changes to binary-loading additionally revealed a bug in the
   example-run script used in docs and tied to CI and the fix is
   included.
 - Shortlist is made optional in the memory bundle, making changes to
   getModelMemoryFromConfig.

Fixes #304.
Fixes #306.
See also: XapaJIaMnu/translateLocally#82.
2022-01-19 16:36:48 +00:00
Jelmer
acbc46d816
Accept XHTML-style self-closing void tags (#305)
Allow the self-closing `/>` end for void tags. For non-void tags these
were already "allowed" due to how the HTML parser works, but for
elements where they actually occur, like `<br/>`, they caused a parse
error. Support for them was not implemented since we only expect valid
HTML5, e.g. the output of Firefox' Element.innerHTML.

Use case: TranslateLocally uses Qt's HTML representation of rich text.
That HTML uses self-closing tags like `<meta .../>` and `<br/>`.
Implementing a string replace operation that would only match these
elements without parsing HTML is tricky. Fixing it in
bergamot-translator is not.

Implementation: Currently `<img>` is marked as a void tag (an element
which cannot have children or text, and therefore treated differently.
Since void tags normally have no close tag, they are treated as
immediately closed. The HTML parser we use reads `<img/>` as
`<img></img>` which thus causes a problem since now we close an element
that was never open, to begin with.

This fix ignores the `TT_TAG_END` token from the parser when the tag
name is that of a void tag.
2022-01-19 09:22:46 +00:00
Jerin Philip
6a4f409cda
First class pivot translation capability (#236)
Translates a text from source-language to target-language through a
pivot-language. Effectively runs models in series, while having the
following additional benefits compared to when `Service::translate(...)`
would be used repeatedly.

1. Consistency in sentences between source and target. Consistent
creation of the alignment matrix for use in downstream tasks like
tag-translation.

2. Efficient sentence-splitting (does not sentence-split twice, creating
inconsistencies).

3. The `Response` generated can be used as if it were coming through
`translate(...)`, eliminating any need for additional code for clients
in JS or python or C++.

`AsyncService::pivot(...)` is provisioned for C++ multi-threaded setting
and `BlockingService::pivotMultiple(...)` provisioned for blocking
use-case targeted at WebAssembly.

# [BRT]: Test additions, accompanying fixes

For `AsyncService` for a test-case involving of en->es, es->en (same
vocabulary, another one might be more coverage but is too much work).

1. Asserts the Alignment generated after pivoting is a probability
distribution over source tokens given target.

2. Outputs the sentences going from en->en, which should stay consistent
over continuous development to ensure nothing breaks.

3. An accuracy minimum of 70% of token matches from source to target
calibrated on the standard bergamot input text is additionally present,
ensuring that the English tokens at start and end match exactly.

# HTML Pipeline

This PR reworks the HTML translation pipeline to be outside
response-construction via callbacks.
2022-01-17 13:44:23 +00:00
Jelmer
e061b5613e
Treat most HTML elements as word-breaking (#286) 2022-01-16 10:26:40 +00:00
Jelmer
13c55e2693
Defer model loading to parallel worker thread (#303) 2022-01-14 10:30:38 +00:00
Jerin Philip
71b84b7c72
CI guaranteed example documentation (#300)
* Convert marian-integration markdown to rst
* Convert native run into a script, include in rst
* Check with CI that the native running example works without fail
2022-01-06 19:10:57 +00:00
Jelmer
dae02a3c8d
HTML transfer script/style/etc elements (#285) 2022-01-05 13:33:51 +00:00
Jerin Philip
81c21928d5
Have alignments placed if HTML is on (#296) 2022-01-03 12:27:41 +00:00
Jerin Philip
3883dd1971
cache: threadsafety-fixes; optional stats collection (#245)
* Make stats hits misses atomic to guard when mutex has multiple buckets
* Use compile time switch for cache-stats-collection bound to COMPILE_TESTS cmake variable
* -DENABLE_CACHE_STATS on if COMPILE_TESTS otherwise optional
* Make stats() call without enabling build fatal abort
2022-01-02 12:33:30 +00:00
Jerin Philip
ddccc77570
Turn logging off by default, allow turning on via config/cmdline (#295)
* Turn logging off by default, allow turning on via config/cmdline
* No need to store config in member variable if things are decided at construction time
2022-01-02 00:17:12 +00:00
Jerin Philip
d209e4fc49
Fix typo in BRT args on CI runs (#294) 2021-12-30 16:12:30 +00:00
Jerin Philip
8eb238ed5e
HTML basic integration tests (#291) 2021-12-30 14:29:12 +00:00
Jerin Philip
6e6042c98f
GitHub CI: Update YAML to run all tests on marian-full (#292)
Previously there were #native tags and #wasm tags separating the two.
There is now a clear separation between async, blocking and wasm.
2021-12-29 11:02:56 +00:00
Abhishek Aggarwal
9e1c1e8dbf
CI: Circle CI config script update (#287)
- Robust artifact presence check
 - Variable name refactoring
 - Storing only those artifacts that are required
 - Remove commit sha from the names of the Github Releases
 - Use BERGAMOT_VERSION file contents for Git Tag names
2021-12-21 23:58:13 +01:00
Jelmer
f55377b687
HTML transfer empty elements (#283)
* Fix test case

This should now be implemented

* Remove FilterEmpty

This path wasn't used anymore anyway, empty tags just got their own spans, and never reached the stack.

* Insert skipped empty source spans into target HTML

Also refactor variable names to better match their contents and be more consistent with each other.
This implementation passes all test cases, finally!

* Fix remaining style changes

* Move HTML formatting to its own section

That code had become exact copies in three different places
2021-12-21 13:44:04 +00:00
Jerin Philip
bcbbfe1295
Better command-line with isolation for both Services and co-located defaults and parsing (#252)
* CLI Rework

* Consolidate common tests, template specialize CLI

* Remove remnant cache stuff

* [BRT]: Run BRT with new cli

* Formalizing bridge

* Removing stuff from parsing and moving to TestSuite

* Template includes, everything consolidating at tests

* Inlining readFromStdin

* Removing unnecessary headers

* Checking in template implementation which was missing

* Sane defaults, some catches at BRT

* BRT: Install fixes

* Updating marian-dev to point to main

* Removing the enum indirection, using strings at one place, directly

* Fix typo;

* [BRT] test blocking service via native

* Conservative defaults for workers and cache-mutex buckets in AsyncService

* Create proper barriers for cmdline app

* Build failure fixes

* Moving common, common-impl to a familiar structure

* Binary reorganization: async, blocking, wasm

- async tests AsyncService
- blocking tests BlockingService
- wasm arranges tests for things that are Mozilla requirements. eg:
    - bytearray
    - multiple sentences in same translate request workflow.

* [brt] updates to adapt to cli rework

* [brt] updates to adapt to cli rework, all working

* Empty commit, sync brt online and run GitHub CI

* Switch for parser to have multiple mode or not

* [brt]: Fix for --bergamot-mode being removed from CLI app

* [brt]: Fix for --bergamot-mode being removed from CLI app

* [brt]: Removing remnant faithful translation test from blocking/
2021-12-21 09:22:37 +00:00
Jelmer
1a27a8e0a7
Increase HTML test coverage (#279)
* Fix bug in HasAlignments check

When fixing it to allow empty sentences, it no longer caught misconfigured models. I've added a test that triggers this scenario, and a fix in HasAlignments for it.

* Add more unit tests for xh_scanner

Trying to increase that code coverage to 100%

* Add test for whitespaces around attributes

* Make accessing value(), attr_name() and tag_name() at the wrong time safer

* Fix bug in <style> and <script> parsing

The end tag was never found

* Fix parsing of mix of valueless and quoteless attributes

* Sync list of void tags with Firefox' implementation of outerHTML and innerHTML

Also lets use their name for it: IsVoidTag instead of IsEmptyElement. Empty was a bit ambiguous.

* Bring back support for processing instructions support in xh_scanner

I noticed in https://searchfox.org/mozilla-central/source/dom/base/nsContentUtils.cpp#8961 that these can be produced by innerHTML under some circumstances.

* More permanent link

* Use CamelCase for the internal functions I added

* Rename *_PI to *_PROCESSING_INSTRUCTION

Your IDE will do the typing for you anyway

* Match symbol naming of the rest of code base

CapitalCase for classes, camelCase for functions, snake_case for variables still.

* Missed one 😴

* Change xhscanner's variable case also to camelCase

* Partially fix case variables in html.cpp
2021-12-20 15:24:30 +00:00
Andre Natal
793d132b7c
Adding circle ci job to push the wasm artifacts to github releases (#280)
* Adding circle ci job to push the wasm artifacts to github releases.
* Updated config.yml
2021-12-18 00:05:11 +01:00
Abhishek Aggarwal
8884b39055
Disabled importing optimized gemm module (#282)
- Until the optimized gemm module stops requiring
   Shared Array Buffer, we can't really use it in
   Firefox
2021-12-17 17:39:43 +01:00
Jelmer
420f12b3ff
Remove value length limit from HTML parser & interpolated alignments (#274)
* Remove InterpolateAlignment

And some code improvements

* Replace the fixed value buffer with a std::string backing

* Fix tests that had no alignment info

These depended on the linear interpolation that I removed

* Remove arbitrary limits on tag and attribute names

This might also fix a bug caused by the eager lower casing of tag names, which could break <![CDATA , <style> and <script>

* Remove equals() in favour of operator==()

I trust the compiler can come up with better optimisations than I can.

* Expose std::strings instead of their data

Should save us some std::strlen() calls

* Add & remove headers and no-longer-defined functions from header files

* Remove all string buffers from xh_scanner

It now directly refers to either the input stream or constant strings

* Replace custom string_view with even lighter struct that's only used internally

To the outside world we just expose std::string_view

* Remove __builtin_sub_overflow for MSVC

* ABORT if trying to restore HTML when no alignment info is available

* Add test cases specifically for xh_scanner

Both good for testing regression, and as a little example/reference for what behaviour to expect from it.

* Add --html option to bergamot for tests

This should make it easier to have some integration tests for HTML input

* Add test and fix for empty inputs failing due to alignment check

Co-authored-by: Jerin Philip <jerinphilip@live.in>
2021-12-15 22:01:49 +00:00
Nikolay Bogoychev
8563f0856f
Proper arch setting on win32 (#275)
* Proper arch detection on win32

* Whoops
2021-12-14 23:53:53 +00:00
Abhishek Aggarwal
feb9c90429
Additional logs in JS translation worker (#277)
- Print source text received in the response
 - Print no. of block elements in the input
2021-12-14 21:52:00 +01:00
Jerin Philip
571d312930
Constrain mistune to fix docs CI (#278) 2021-12-14 16:34:30 +00:00
Abhishek Aggarwal
e75a9e1da3
More robust logic to import wasm gemm (#276)
- Import optimized gemm implementation only if all the necessary functions
   are provided by it, othewise use the fallback gemm
2021-12-14 16:39:19 +01:00
Abhishek Aggarwal
8e79897f30
Updated configuration for html text translation to work in wasm test page (#269)
* Updated translator configuration in wasm test page
 - Added alignment: soft

* Set ResponseOptions::alignment to "true"
 - Had to be set for html text translation to work
2021-12-01 11:32:51 +01:00
Abhishek Aggarwal
e8fd01e9f4 Updated marian-dev submodule 2021-11-30 17:19:42 +01:00
Jelmer
eea5554b91
HTML handling improvements (#266)
* Fix out-of-bounds error when determining alignment for whole word

If token at offset 0 was a continuation (which it always is, since the first word of a sentence does not start with a space) it would jump to (unsigned) -1 which is probably out of bounds.

* Don't segfault if alignment info is not available

When alignment info is requested, but model is missing `alignment: soft` you'd get empty alignment info for every target token.

* Partial fix for handling empty elements

This fixes a parse error when dealing with something like `<p>...<br></p>` or `...<br>` where there is no text after the last empty element. This also prevents losing empty elements in the source side of the translation. Empty elements are not yet transferred correctly to the target side.

* Fix formatting
2021-11-29 08:41:24 +00:00