Commit Graph

428 Commits

Author SHA1 Message Date
Jerin Philip
5a693b7eec
Fixes windows workflow for PCRE2 (#260) 2021-11-05 20:48:28 +00:00
Jerin Philip
fa4efb483b
Update ssplit cpp, pcre2 source compile to fix broken builds (#258)
* Update ssplit cpp, pcre2 source compile to fix tests

* Syncing with browsermt/ssplit-cpp

* Removing accidental binary inclusion

* Removing brt accidental update by git add -u

* Fix windows workflow, vcpkg is broken use our cmake route

* [ssplit-cpp] Try searching different library names for Windows
2021-11-05 16:46:03 +00:00
Abhishek Aggarwal
7693a1d007
Updated marian submodule (#256) 2021-11-03 13:54:48 +01:00
Jerin Philip
0bb8095bca
Deprecate hardAlignment in favour of softAlignment (#250) 2021-11-01 19:21:28 +00:00
Jerin Philip
806169c822
Recover logging (#226) 2021-11-01 16:31:01 +00:00
Abhishek Aggarwal
c5bc3f5191
Update config "skip-cost" to enable log probabilities for QE scores (#247)
- Updated wasm test page
2021-11-01 13:06:23 +01:00
Jerin Philip
9b443997e2
EXCLUDE_FROM_ALL for marian and ssplit-cpp 3rd-party libraries (#243) 2021-10-31 12:33:42 +00:00
Jerin Philip
47e57c95a6
[ssplit-cpp] Enable position independent library when compiled from sources (#240) 2021-10-29 13:40:28 +01:00
Jerin Philip
45412ce7de
Set PR to any branch to trigger workflows (#230) 2021-10-28 09:30:02 +01:00
Jerin Philip
2b98c67996
Cache for translations (#227)
Sets a cache to operate for each sentence that a TranslationModel process
caching the corresponding marian::History for a {TranslationModel::Id, marian::Words}
key.  Cache is thus shared across multiple TranslationModels bound to the lifetime
of a Service. Cache gracefully downgrades in the case of WebAssembly.
2021-10-27 20:37:05 +01:00
Abhishek Aggarwal
d0d08c0f54
JS bindings for Quality Estimation (#239)
* Quality Score bindings complete
* Updated wasm test page to test the bindings
  - Word and sentence scores can be seen in browser console
2021-10-27 19:26:55 +02:00
Abhishek Aggarwal
c5167b3d8c
Import matrix-multiply from a separate wasm module (#232)
* Updated marian-dev submodule
* Import wasm gemm from a separate wasm module
 - The fallback implementation of gemm is currently being imported dynamically
   for wasm target
* Updated CI scripts and README to import GEMM from a separate wasm module
* Setting model config to int8shiftAlphaAll in wasm test page
2021-10-27 11:54:39 +02:00
Abhishek Aggarwal
a0cb1e4b3d
Wasm test page UI for translating b/w non-English language pairs (#231)
* Updated Wasm test page UI for translating b/w non-English language pairs
* Both "from" and "to" language dropdowns now allow non-English languages
2021-10-19 14:40:54 +02:00
Abhishek Aggarwal
c7b626dfd0
Adapted wasm test page for new Service interface (#224)
- The new interface now supports running multiple TranslationModels
2021-09-28 15:53:02 +05:30
Jerin Philip
cf541c68f9
Multiple TranslationModels Implementation (#210)
For outbound translation, we require having multiple models in the
inventory at the same time and abstracting the "how-to-translate" 
using a model out.

Reorganization: TranslationModel + Service. The new entity which
contains everything required to translate in one direction is
`TranslationModel`. The how-to-translate blocking single-threaded mode
of operation or async multi-threaded mode of operation is decoupled as
`BlockingService` and `AsyncService`. There is a new regression-test
using multiple models in conjunction added, also serving as
a demonstration for using multiple models in Outbound Translation.

WASM: WebAssembly due to the inability to use threads uses
`BlockingService.  Bindings are provided with a new API to work with a
Service, and multiple TranslationModels which the client (JS extension)
can inventory and maintain.  Ownership of a given `TranslationModel` is
shared while translations using the model are active in the internal
mechanism.

Config-Parsing: So far bergamot-translator has been hijacking marian's
config-parsing mechanisms. However, in order to support multiple models,
it has become impractical to continue this approach and a new
config-parsing that is bergamot specific is provisioned for
command-line applications constituting tests. The original marian
config-parsing tooling is only associated with a subset of
`TranslationModel` now. The new config-parsing for the library manages
workers and other common options (tentatively).

There is a known issue of: Inefficient placing of workspaces, leading to
more memory usage than what's necessary. This is to be fixed trickling
down from marian-dev in a later pull request. 

This PR also brings in BRT changes which fix speed-tests that were
broken and also fixes some QE outputs which were different due to not
using shortlist.
2021-09-21 18:10:40 +01:00
Andre Barbosa
63120c174e
QualityEstimation: Preliminary Implementation (#197)
Unifies quality estimation with an interface, refactors previously available
quality scores to fit this interface. Adds a new class of  model with Logistic
Regression powering the predictions as an implementation of said interface. 
QE now provides annotations on words using subwords to word rule-based 
algorithms working with space characters. 

QualityEstimation
-----------------

Implementations of QE are bound together by a `QualityEstimator`
Interface. 

1. The log-probabilities from the machine-translation model re-interpreted
   as quality scores are crafted as an implementation of QualityEstimator.

2. A Logistic-Regression based model is added. This class of models is
   trained supervised with scores labeled by a human annotator.
   Handcrafted features - number of words, log probs from MT model and 
   statistics over the sequence are used to generate the numeric features.
   LogisticRegressor, Matrix (to hold features) are added.

The creation of an instance is switched by the `AlignedMemory` supplied
(be it loaded from the file-system or supplied as a parameter). An empty
AlignedMemory leads to quality scores from NMT while supplying weights
of a trained logistic-regression model in binary format as the contents
lead to an additional pass through the said model to provide more
refined scores.

Both the above now transform subwords into "words" using a heuristic
algorithm, scanning for spaces. This allows the client to work with "words"
to denote quality instead of subwords, as the former is more sensible to
the user.

Testing
-------

1. BRT now has two new test apps to check the QE outputs in text
  (covers subword to words) and numbers domain (covers quality scores).
  These are tested with en-et models for which QualityEstimation is
  available now, on a new input to avoid architecture/compiler issues.
2. Unit test for LogisticRegression model is added.


Docs
----

Doxygen now supports MathJax properly to render explanations for
Logistic Regressions' reductions in place to make computation more
efficient correctly.

Co-authored-by: Felipe C. Dos Santos <felipe.santos.k@gmail.com>
Co-authored-by: Jerin Philip <jerinphilip@live.in>
2021-09-16 16:28:40 +01:00
Jerin Philip
48e955c468
BRT: Update sacrebleu to get tests back working (#217)
Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>
2021-09-07 19:10:41 +01:00
Abhishek Aggarwal
8e4374282a Circle CI wasm artifacts for non-wormhole builds 2021-08-31 17:01:52 +02:00
Abhishek Aggarwal
cafb65e0b5 Wasm builds without SharedArrayBuffer 2021-08-27 09:07:06 +02:00
Abhishek Aggarwal
ff391c6f00 Updated marian submodule to latest commit of master 2021-08-27 09:07:06 +02:00
Abhishek Aggarwal
b64ffce496
Wasm test page using web workers now (#218) 2021-08-26 15:22:52 +02:00
Jerin Philip
972d8560b5
Add a clang-tidy run (#214)
Adds a clang-tidy run in addition to the existing clang-format checks.
The clang-tidy checks are not enforced, but is potentially useful to
point to during review.
2021-08-13 16:26:44 +01:00
Abhishek Aggarwal
9994d4acdb
Merge pull request #215 from abhi-agg/non-wormhole-builds
Wasm build instructions to run translations on other browsers
2021-08-11 19:19:13 +02:00
Abhishek Aggarwal
f3e00ae657 Added build instructions to run on other browsers
- Disabled compiling with wormhole which is Firefox specific feature
2021-08-11 13:28:15 +02:00
Jerin Philip
d31f96381b
Windows workflow: run-vcpkg7.{3->4}; vcpkg master (#208)
A cmake change has caused vcpkg to fail without much error message,
which is causing windows workflow runs to fail. Details in the following
link:

* https://github.com/microsoft/vcpkg/issues/18718

To fix, we're going with a version bump in vcpkg. Seeing that run-vcpkg
also seems to have gotten an update, updating run-vcpkg from 7.3 to 7.4
Playing with fire: vcpkg master commit
2021-07-29 12:25:09 +01:00
Abhishek Aggarwal
5a8fe209ce Wasm: Enabled sentence byte ranges in the wasm test page
- Use JS bindings to print all sentences individually on
   console
2021-07-19 12:06:22 +02:00
Abhishek Aggarwal
7052722cd2 JS bindings to return sentence byte ranges 2021-07-19 12:06:22 +02:00
Abhishek Aggarwal
6ad794fcef Added public methods in Response class to return sentences
- Refactored ByteRange struct and moved it to definition.h
2021-07-19 12:06:22 +02:00
Jerin Philip
a202e350c7
Change ResponseBuilder to accept callback instead of future (#142)
* Change ResponseBuilder to accept callback

Breaks things everywhere, now we follow the compiler to fix and convert
the std::future -> callback.

* More std::future -> callback

* std::future out of service.{h,cpp}

* compile is working, so is callback

* Some reshuffling of args

* Fixing merge error

* Fixing signature conflicts out of merge

* Fixing that test duct-taping future

* Minor adjustment to get that future back

* Add documentation for the new callback function

* Applying clang-format after update

* Using default responseOptions

* Remove future references from documentation

* translateMultiple only for WASM (#177)

* BRT: update to main; fresh-failures hopefully

* Converting test translateFromStdin to use callback

* BRT: Add fresh #native and #wasm tags

* future from promise, fix error

* Adding #native to GitHub CI

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>
2021-07-05 14:51:01 +01:00
Jerin Philip
cb855be1a7
maxLengthBreak_ -> wrapStep bugfix (#200) 2021-06-28 14:54:39 +01:00
Jerin Philip
13a1fe870f
Load sentence-splitter (non-breaking prefixes) from ByteArray
Service now allows loading Sentence-Splitter (non-breaking prefix file) from ByteArray. Behaviour is consistent with the rest of the ByteArray loads (model, shortlist), where first the ByteArray is checked if empty, if not fall back to loading from file-path. 

Adds regression test to check if source-sentences in constructed Response match expected behaviour when the non-breaking-prefixes file is provided. 

Bonus refactoring to remove an extra layer that existed for no reason.
2021-06-21 18:53:30 +01:00
Jerin Philip
44aa70a064
Account for EOS in both source and target annotations (#190) 2021-06-15 18:59:51 +01:00
Abhishek Aggarwal
b00116cb94
Refactor wasm bindings to use consistent interface names as in native (#195)
* Refactored wasm bindings code
 - Replaced TranslationModel, TranslationRequest and TranslationResult
    with Service, ResponseOptions and Response
 - Corresponding documentation changes
 - Names of the bindings files changed
 - Moved Vector<Response> definition in Response specific bindings
   file
2021-06-15 16:02:14 +02:00
Jerin Philip
4b014665ba
Removing alignments and quality-scores test-code (#196)
* Removing alignments and quality-scores test-code
* BRT: Update to main
2021-06-14 18:40:41 +01:00
Jerin Philip
e9e5ac6782
Partial test-apps and tolerance in evaluations (#184)
* Partial test applications

Previously service-cli was used to generate output and accomplish
regression testing for all of: (1) translated-text (2) alignment tokens
+ scores (3) quality scores (4) indirectly annotation and tokenizations.

The --mode native now only outputs a faithful to source translated text
of the input source on stdin.

Test apps are separated into testing only individual functionalities.
This can help in independently testing ssplit-cpp, quality-scores for
the quality estimation implementation etc.

Separating numbers and text have the advantage of being able to compare
one with tolerance using BLEU (text) and some allowed error-rates
(numbers).

* Removing #mac tag

* Moving test apps to src/tests

* Tests are always on for CI

Unit tests are turned off looking for WASM_COMPATIBLE_SOURCES.

* Fixing WASM_COMPATIBLE_SOURCE -> USE_WASM_COMPATIBLE_SOURCE

* Workaround for now; CMakeLists.txt horrors are starting to bite

* BRT: use bergamot-test instead of bergamot now

* This should fix issues: CMakeLists.txt has so many paths

* Casing to camelCase and removing legacyServiceCli

* removing leftover service-cli declaration, some doc updates

* #pragma once is starting to look easier

* All the more reasons to do #pragma once

* Updating marian-dev with intgemm::kCPU print, resolved from INTGEMM_CPUID

* BRT: Use --gemm-highest-arch instead of python script

* Adding intgemm resolve here, where always(?) have intgemm on?

* intgemm-resolve in default binary directory

* BRT: Update to use intgemm-resolve

* marian-dev: Reset to without --gemm-highest-precision

Co-authored-by: Kenneth Heafield <kpu@users.noreply.github.com>
2021-06-14 15:02:42 +01:00
Abhishek Aggarwal
16eb47f47e
Generating cmake configured project version (.js) file in build folder (#194)
- Earlier this file was being generated in folder containing
   actual sources

 - Fixes https://github.com/browsermt/bergamot-translator/issues/161
2021-06-09 13:57:23 +01:00
Jerin Philip
3039dea34b
Fixing if syntax with YAML var subsitution (#188) 2021-06-09 10:21:23 +01:00
Jerin Philip
dc2fb3d64e
CMake fixes: Generate project.h in binary dir, fix GetVersionFromFile for use as submodule. (#193)
* Use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR for project bound version string

* marian-dev cmake fix

* Generate project.h in binary dir

* We don't want people asking about extra spaces
2021-06-09 10:12:00 +01:00
Abhishek Aggarwal
3e46e3391c Consistent EMSDK version and parallel make jobs in README and github actions
- Set EMSDK version to 2.0.9 to make it consistent
   everywhere in repo
 - Set parallel make jobs to 2
2021-06-09 11:10:10 +02:00
Jerin Philip
d39e0277c6
Replace resize with possible negative range with pop_back() (#189) 2021-06-05 00:28:53 +01:00
Jerin Philip
71a62405e7
Update native (ubuntu, mac) workflows with ccache (#181)
* Matrix is now more organized, Ubuntu 20.04-gcc9.3, Ubuntu-18.04-gcc7.5 is added.
* ccache is extended to MacOS, and brings down CI run times to <5m when
  ccache works.
* The compiler hash scripts are gone, ccache already covers most ground
  by default. The shell script is unnecessary. Cache works by preprocessor
  mode output of running the compiler with -E, which includes the
  necessary information. ccache-docs:How the cache works.
* BRT if failed prints the final 20 lines of the test*.log to inspect
  what's going wrong without having to artifact download.
* Pull request on any branch triggers workflow.
* Push on main and ci-sandbox triggers workflow.
2021-06-04 11:52:36 +01:00
Kenneth Heafield
5f0d3963e2
Remove addSentenceWithPriority (#186) 2021-06-04 00:09:20 +01:00
Jerin Philip
73228bbb4a
Updating marian-dev: intgemm with env variable matmul switches (#187) 2021-06-03 21:01:26 +01:00
Jerin Philip
330840338c
Including WASM documentation in sphinx build toc (#176) 2021-06-01 12:39:28 +01:00
Jerin Philip
ceaf21a532
Deploy generated documentation only if browsermt (#179) 2021-06-01 11:00:53 +01:00
Jerin Philip
5d3ec9c0a9
Single executable (#175)
* Collapsing executables

* Adding new test executable

* Deleting old executable sources

* Updating brt to operate with modes

* cli-framework -> cli

* Updating workflows to check for bergamot instead of bergamot-translator-app

* Adding documentation

* Making fn pure virtual

* Shuffling apps into app namespace, alongside class documentation

* Include app folder in documentation

* BRT update service-cli -> native

* parser.h: service-cli -> native

* Updates to marian-integration.md

* Cleanup: Remove templates, interface proper

* change 4 to 2 cores for build instructions

* service-cli -> native

* Commenting the string constructor explanation

* Not doing halfway interface / inheritance

* Nick hates state, let's try this one

* Revert "Nick hates state, let's try this one"

This reverts commit e56db9f474.

* class -> struct before trying std::function stuff

* oop -> functional?

* Hints on what is happening

* app::ftable -> app::REGISTRY

* We have if-else and functions now.

And we won't have test apps.

* Doc linking to usage examples in brt

* Remove unordered_map

* Documentation updates

* Fix warning
2021-05-31 14:44:59 +01:00
Jerin Philip
eb579ed26f
Updating marian dev RelwithDebInfo -> Release (#178)
* Updating marian dev RelwithDebInfo -> Release

* Updating submodule to point to master
2021-05-27 10:51:53 +01:00
Qianqian Zhu
8bec1b7b6b
Fix failures when loading text shortlist (#154) 2021-05-25 12:05:16 +01:00
Jerin Philip
576afae6b3
Adding documentation action (#168)
Adds a GitHub workflow that builds documentation from sources through doxygen through sphinx on push to the main branch or on push of any semantic version tags. The built documentation is deployed at https://github.com/browsermt/docs@gh-pages, which is rendered at https://browser.mt/docs/<suffix>, where <suffix> is 'main' or a tag vM.m.p corresponding to a semantic version.

On pull request artifacts are uploaded for reviewers to inspect if need be.
2021-05-25 11:10:56 +01:00
Jerin Philip
22a1b9113e
Remove O(N^2) reallocation (#171) 2021-05-22 00:04:49 +01:00