bergamot-translator

mirror of https://github.com/browsermt/bergamot-translator.git synced 2024-09-11 13:46:12 +03:00

Author	SHA1	Message	Date
Abhishek Aggarwal	8e79897f30	Updated configuration for html text translation to work in wasm test page (#269 ) * Updated translator configuration in wasm test page - Added alignment: soft * Set ResponseOptions::alignment to "true" - Had to be set for html text translation to work	2021-12-01 11:32:51 +01:00
Abhishek Aggarwal	e8fd01e9f4	Updated marian-dev submodule	2021-11-30 17:19:42 +01:00
Jelmer	eea5554b91	HTML handling improvements (#266 ) * Fix out-of-bounds error when determining alignment for whole word If token at offset 0 was a continuation (which it always is, since the first word of a sentence does not start with a space) it would jump to (unsigned) -1 which is probably out of bounds. * Don't segfault if alignment info is not available When alignment info is requested, but model is missing `alignment: soft` you'd get empty alignment info for every target token. * Partial fix for handling empty elements This fixes a parse error when dealing with something like `<p>...<br></p>` or `...<br>` where there is no text after the last empty element. This also prevents losing empty elements in the source side of the translation. Empty elements are not yet transferred correctly to the target side. * Fix formatting	2021-11-29 08:41:24 +00:00
Kenneth Heafield	40366162d8	HTML input (#253 ) Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl> Co-authored-by: Abhishek Aggarwal <aaggarwal@mozilla.com>	2021-11-25 13:57:50 +00:00
Abhishek Aggarwal	2b1b0531ff	Import optimized gemm implementation (when available) for wasm target (#265 ) * Enable importing optimized gemm module for wasm - Updated emscripten generated JS code to -- import and use the optimized gemm module when available, otherwise use fallback gemm implementation * Added logging for gemm implementation being used for wasm target	2021-11-17 09:18:55 +01:00
Abhishek Aggarwal	f9e55b3cd8	Make script run from any directory (#262 ) * Make script run from any directory	2021-11-15 22:30:52 +01:00
Andre Natal	d6a14b1d6f	Fix badge to point to this repo instead mozilla's (#261 )	2021-11-15 08:14:21 +00:00
Jerin Philip	5a693b7eec	Fixes windows workflow for PCRE2 (#260 )	2021-11-05 20:48:28 +00:00
Jerin Philip	fa4efb483b	Update ssplit cpp, pcre2 source compile to fix broken builds (#258 ) * Update ssplit cpp, pcre2 source compile to fix tests * Syncing with browsermt/ssplit-cpp * Removing accidental binary inclusion * Removing brt accidental update by git add -u * Fix windows workflow, vcpkg is broken use our cmake route * [ssplit-cpp] Try searching different library names for Windows	2021-11-05 16:46:03 +00:00
Abhishek Aggarwal	7693a1d007	Updated marian submodule (#256 )	2021-11-03 13:54:48 +01:00
Jerin Philip	0bb8095bca	Deprecate hardAlignment in favour of softAlignment (#250 )	2021-11-01 19:21:28 +00:00
Jerin Philip	806169c822	Recover logging (#226 )	2021-11-01 16:31:01 +00:00
Abhishek Aggarwal	c5bc3f5191	Update config "skip-cost" to enable log probabilities for QE scores (#247 ) - Updated wasm test page	2021-11-01 13:06:23 +01:00
Jerin Philip	9b443997e2	EXCLUDE_FROM_ALL for marian and ssplit-cpp 3rd-party libraries (#243 )	2021-10-31 12:33:42 +00:00
Jerin Philip	47e57c95a6	[ssplit-cpp] Enable position independent library when compiled from sources (#240 )	2021-10-29 13:40:28 +01:00
Jerin Philip	45412ce7de	Set PR to any branch to trigger workflows (#230 )	2021-10-28 09:30:02 +01:00
Jerin Philip	2b98c67996	Cache for translations (#227 ) Sets a cache to operate for each sentence that a TranslationModel process caching the corresponding marian::History for a {TranslationModel::Id, marian::Words} key. Cache is thus shared across multiple TranslationModels bound to the lifetime of a Service. Cache gracefully downgrades in the case of WebAssembly.	2021-10-27 20:37:05 +01:00
Abhishek Aggarwal	d0d08c0f54	JS bindings for Quality Estimation (#239 ) * Quality Score bindings complete * Updated wasm test page to test the bindings - Word and sentence scores can be seen in browser console	2021-10-27 19:26:55 +02:00
Abhishek Aggarwal	c5167b3d8c	Import matrix-multiply from a separate wasm module (#232 ) * Updated marian-dev submodule * Import wasm gemm from a separate wasm module - The fallback implementation of gemm is currently being imported dynamically for wasm target * Updated CI scripts and README to import GEMM from a separate wasm module * Setting model config to int8shiftAlphaAll in wasm test page	2021-10-27 11:54:39 +02:00
Abhishek Aggarwal	a0cb1e4b3d	Wasm test page UI for translating b/w non-English language pairs (#231 ) * Updated Wasm test page UI for translating b/w non-English language pairs * Both "from" and "to" language dropdowns now allow non-English languages	2021-10-19 14:40:54 +02:00
Abhishek Aggarwal	c7b626dfd0	Adapted wasm test page for new Service interface (#224 ) - The new interface now supports running multiple TranslationModels	2021-09-28 15:53:02 +05:30
Jerin Philip	cf541c68f9	Multiple TranslationModels Implementation (#210 ) For outbound translation, we require having multiple models in the inventory at the same time and abstracting the "how-to-translate" using a model out. Reorganization: TranslationModel + Service. The new entity which contains everything required to translate in one direction is `TranslationModel`. The how-to-translate blocking single-threaded mode of operation or async multi-threaded mode of operation is decoupled as `BlockingService` and `AsyncService`. There is a new regression-test using multiple models in conjunction added, also serving as a demonstration for using multiple models in Outbound Translation. WASM: WebAssembly due to the inability to use threads uses `BlockingService. Bindings are provided with a new API to work with a Service, and multiple TranslationModels which the client (JS extension) can inventory and maintain. Ownership of a given `TranslationModel` is shared while translations using the model are active in the internal mechanism. Config-Parsing: So far bergamot-translator has been hijacking marian's config-parsing mechanisms. However, in order to support multiple models, it has become impractical to continue this approach and a new config-parsing that is bergamot specific is provisioned for command-line applications constituting tests. The original marian config-parsing tooling is only associated with a subset of `TranslationModel` now. The new config-parsing for the library manages workers and other common options (tentatively). There is a known issue of: Inefficient placing of workspaces, leading to more memory usage than what's necessary. This is to be fixed trickling down from marian-dev in a later pull request. This PR also brings in BRT changes which fix speed-tests that were broken and also fixes some QE outputs which were different due to not using shortlist.	2021-09-21 18:10:40 +01:00
Andre Barbosa	63120c174e	QualityEstimation: Preliminary Implementation (#197 ) Unifies quality estimation with an interface, refactors previously available quality scores to fit this interface. Adds a new class of model with Logistic Regression powering the predictions as an implementation of said interface. QE now provides annotations on words using subwords to word rule-based algorithms working with space characters. QualityEstimation ----------------- Implementations of QE are bound together by a `QualityEstimator` Interface. 1. The log-probabilities from the machine-translation model re-interpreted as quality scores are crafted as an implementation of QualityEstimator. 2. A Logistic-Regression based model is added. This class of models is trained supervised with scores labeled by a human annotator. Handcrafted features - number of words, log probs from MT model and statistics over the sequence are used to generate the numeric features. LogisticRegressor, Matrix (to hold features) are added. The creation of an instance is switched by the `AlignedMemory` supplied (be it loaded from the file-system or supplied as a parameter). An empty AlignedMemory leads to quality scores from NMT while supplying weights of a trained logistic-regression model in binary format as the contents lead to an additional pass through the said model to provide more refined scores. Both the above now transform subwords into "words" using a heuristic algorithm, scanning for spaces. This allows the client to work with "words" to denote quality instead of subwords, as the former is more sensible to the user. Testing ------- 1. BRT now has two new test apps to check the QE outputs in text (covers subword to words) and numbers domain (covers quality scores). These are tested with en-et models for which QualityEstimation is available now, on a new input to avoid architecture/compiler issues. 2. Unit test for LogisticRegression model is added. Docs ---- Doxygen now supports MathJax properly to render explanations for Logistic Regressions' reductions in place to make computation more efficient correctly. Co-authored-by: Felipe C. Dos Santos <felipe.santos.k@gmail.com> Co-authored-by: Jerin Philip <jerinphilip@live.in>	2021-09-16 16:28:40 +01:00
Jerin Philip	48e955c468	BRT: Update sacrebleu to get tests back working (#217 ) Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>	2021-09-07 19:10:41 +01:00
Abhishek Aggarwal	8e4374282a	Circle CI wasm artifacts for non-wormhole builds	2021-08-31 17:01:52 +02:00
Abhishek Aggarwal	cafb65e0b5	Wasm builds without SharedArrayBuffer	2021-08-27 09:07:06 +02:00
Abhishek Aggarwal	ff391c6f00	Updated marian submodule to latest commit of master	2021-08-27 09:07:06 +02:00
Abhishek Aggarwal	b64ffce496	Wasm test page using web workers now (#218 )	2021-08-26 15:22:52 +02:00
Jerin Philip	972d8560b5	Add a clang-tidy run (#214 ) Adds a clang-tidy run in addition to the existing clang-format checks. The clang-tidy checks are not enforced, but is potentially useful to point to during review.	2021-08-13 16:26:44 +01:00
Abhishek Aggarwal	9994d4acdb	Merge pull request #215 from abhi-agg/non-wormhole-builds Wasm build instructions to run translations on other browsers	2021-08-11 19:19:13 +02:00
Abhishek Aggarwal	f3e00ae657	Added build instructions to run on other browsers - Disabled compiling with wormhole which is Firefox specific feature	2021-08-11 13:28:15 +02:00
Jerin Philip	d31f96381b	Windows workflow: run-vcpkg7.{3->4}; vcpkg master (#208 ) A cmake change has caused vcpkg to fail without much error message, which is causing windows workflow runs to fail. Details in the following link: * https://github.com/microsoft/vcpkg/issues/18718 To fix, we're going with a version bump in vcpkg. Seeing that run-vcpkg also seems to have gotten an update, updating run-vcpkg from 7.3 to 7.4 Playing with fire: vcpkg master commit	2021-07-29 12:25:09 +01:00
Abhishek Aggarwal	5a8fe209ce	Wasm: Enabled sentence byte ranges in the wasm test page - Use JS bindings to print all sentences individually on console	2021-07-19 12:06:22 +02:00
Abhishek Aggarwal	7052722cd2	JS bindings to return sentence byte ranges	2021-07-19 12:06:22 +02:00
Abhishek Aggarwal	6ad794fcef	Added public methods in Response class to return sentences - Refactored ByteRange struct and moved it to definition.h	2021-07-19 12:06:22 +02:00
Jerin Philip	a202e350c7	Change ResponseBuilder to accept callback instead of future (#142 ) * Change ResponseBuilder to accept callback Breaks things everywhere, now we follow the compiler to fix and convert the std::future -> callback. * More std::future -> callback * std::future out of service.{h,cpp} * compile is working, so is callback * Some reshuffling of args * Fixing merge error * Fixing signature conflicts out of merge * Fixing that test duct-taping future * Minor adjustment to get that future back * Add documentation for the new callback function * Applying clang-format after update * Using default responseOptions * Remove future references from documentation * translateMultiple only for WASM (#177) * BRT: update to main; fresh-failures hopefully * Converting test translateFromStdin to use callback * BRT: Add fresh #native and #wasm tags * future from promise, fix error * Adding #native to GitHub CI Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>	2021-07-05 14:51:01 +01:00
Jerin Philip	cb855be1a7	maxLengthBreak_ -> wrapStep bugfix (#200 )	2021-06-28 14:54:39 +01:00
Jerin Philip	13a1fe870f	Load sentence-splitter (non-breaking prefixes) from ByteArray Service now allows loading Sentence-Splitter (non-breaking prefix file) from ByteArray. Behaviour is consistent with the rest of the ByteArray loads (model, shortlist), where first the ByteArray is checked if empty, if not fall back to loading from file-path. Adds regression test to check if source-sentences in constructed Response match expected behaviour when the non-breaking-prefixes file is provided. Bonus refactoring to remove an extra layer that existed for no reason.	2021-06-21 18:53:30 +01:00
Jerin Philip	44aa70a064	Account for EOS in both source and target annotations (#190 )	2021-06-15 18:59:51 +01:00
Abhishek Aggarwal	b00116cb94	Refactor wasm bindings to use consistent interface names as in native (#195 ) * Refactored wasm bindings code - Replaced TranslationModel, TranslationRequest and TranslationResult with Service, ResponseOptions and Response - Corresponding documentation changes - Names of the bindings files changed - Moved Vector<Response> definition in Response specific bindings file	2021-06-15 16:02:14 +02:00
Jerin Philip	4b014665ba	Removing alignments and quality-scores test-code (#196 ) * Removing alignments and quality-scores test-code * BRT: Update to main	2021-06-14 18:40:41 +01:00
Jerin Philip	e9e5ac6782	Partial test-apps and tolerance in evaluations (#184 ) * Partial test applications Previously service-cli was used to generate output and accomplish regression testing for all of: (1) translated-text (2) alignment tokens + scores (3) quality scores (4) indirectly annotation and tokenizations. The --mode native now only outputs a faithful to source translated text of the input source on stdin. Test apps are separated into testing only individual functionalities. This can help in independently testing ssplit-cpp, quality-scores for the quality estimation implementation etc. Separating numbers and text have the advantage of being able to compare one with tolerance using BLEU (text) and some allowed error-rates (numbers). * Removing #mac tag * Moving test apps to src/tests * Tests are always on for CI Unit tests are turned off looking for WASM_COMPATIBLE_SOURCES. * Fixing WASM_COMPATIBLE_SOURCE -> USE_WASM_COMPATIBLE_SOURCE * Workaround for now; CMakeLists.txt horrors are starting to bite * BRT: use bergamot-test instead of bergamot now * This should fix issues: CMakeLists.txt has so many paths * Casing to camelCase and removing legacyServiceCli * removing leftover service-cli declaration, some doc updates * #pragma once is starting to look easier * All the more reasons to do #pragma once * Updating marian-dev with intgemm::kCPU print, resolved from INTGEMM_CPUID * BRT: Use --gemm-highest-arch instead of python script * Adding intgemm resolve here, where always(?) have intgemm on? * intgemm-resolve in default binary directory * BRT: Update to use intgemm-resolve * marian-dev: Reset to without --gemm-highest-precision Co-authored-by: Kenneth Heafield <kpu@users.noreply.github.com>	2021-06-14 15:02:42 +01:00
Abhishek Aggarwal	16eb47f47e	Generating cmake configured project version (.js) file in build folder (#194 ) - Earlier this file was being generated in folder containing actual sources - Fixes https://github.com/browsermt/bergamot-translator/issues/161	2021-06-09 13:57:23 +01:00
Jerin Philip	3039dea34b	Fixing if syntax with YAML var subsitution (#188 )	2021-06-09 10:21:23 +01:00
Jerin Philip	dc2fb3d64e	CMake fixes: Generate project.h in binary dir, fix GetVersionFromFile for use as submodule. (#193 ) * Use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR for project bound version string * marian-dev cmake fix * Generate project.h in binary dir * We don't want people asking about extra spaces	2021-06-09 10:12:00 +01:00
Abhishek Aggarwal	3e46e3391c	Consistent EMSDK version and parallel make jobs in README and github actions - Set EMSDK version to 2.0.9 to make it consistent everywhere in repo - Set parallel make jobs to 2	2021-06-09 11:10:10 +02:00
Jerin Philip	d39e0277c6	Replace resize with possible negative range with pop_back() (#189 )	2021-06-05 00:28:53 +01:00
Jerin Philip	71a62405e7	Update native (ubuntu, mac) workflows with ccache (#181 ) * Matrix is now more organized, Ubuntu 20.04-gcc9.3, Ubuntu-18.04-gcc7.5 is added. * ccache is extended to MacOS, and brings down CI run times to <5m when ccache works. * The compiler hash scripts are gone, ccache already covers most ground by default. The shell script is unnecessary. Cache works by preprocessor mode output of running the compiler with -E, which includes the necessary information. ccache-docs:How the cache works. * BRT if failed prints the final 20 lines of the test.log to inspect what's going wrong without having to artifact download. Pull request on any branch triggers workflow. * Push on main and ci-sandbox triggers workflow.	2021-06-04 11:52:36 +01:00
Kenneth Heafield	5f0d3963e2	Remove addSentenceWithPriority (#186 )	2021-06-04 00:09:20 +01:00
Jerin Philip	73228bbb4a	Updating marian-dev: intgemm with env variable matmul switches (#187 )	2021-06-03 21:01:26 +01:00

1 2 3 4 5 ...

435 Commits