bergamot-translator

mirror of https://github.com/browsermt/bergamot-translator.git synced 2024-09-11 05:35:33 +03:00

Author	SHA1	Message	Date
Jelmer	acbc46d816	Accept XHTML-style self-closing void tags (#305 ) Allow the self-closing `/>` end for void tags. For non-void tags these were already "allowed" due to how the HTML parser works, but for elements where they actually occur, like `<br/>`, they caused a parse error. Support for them was not implemented since we only expect valid HTML5, e.g. the output of Firefox' Element.innerHTML. Use case: TranslateLocally uses Qt's HTML representation of rich text. That HTML uses self-closing tags like `<meta .../>` and `<br/>`. Implementing a string replace operation that would only match these elements without parsing HTML is tricky. Fixing it in bergamot-translator is not. Implementation: Currently `<img>` is marked as a void tag (an element which cannot have children or text, and therefore treated differently. Since void tags normally have no close tag, they are treated as immediately closed. The HTML parser we use reads `<img/>` as `<img></img>` which thus causes a problem since now we close an element that was never open, to begin with. This fix ignores the `TT_TAG_END` token from the parser when the tag name is that of a void tag.	2022-01-19 09:22:46 +00:00
Jerin Philip	6a4f409cda	First class pivot translation capability (#236 ) Translates a text from source-language to target-language through a pivot-language. Effectively runs models in series, while having the following additional benefits compared to when `Service::translate(...)` would be used repeatedly. 1. Consistency in sentences between source and target. Consistent creation of the alignment matrix for use in downstream tasks like tag-translation. 2. Efficient sentence-splitting (does not sentence-split twice, creating inconsistencies). 3. The `Response` generated can be used as if it were coming through `translate(...)`, eliminating any need for additional code for clients in JS or python or C++. `AsyncService::pivot(...)` is provisioned for C++ multi-threaded setting and `BlockingService::pivotMultiple(...)` provisioned for blocking use-case targeted at WebAssembly. # [BRT]: Test additions, accompanying fixes For `AsyncService` for a test-case involving of en->es, es->en (same vocabulary, another one might be more coverage but is too much work). 1. Asserts the Alignment generated after pivoting is a probability distribution over source tokens given target. 2. Outputs the sentences going from en->en, which should stay consistent over continuous development to ensure nothing breaks. 3. An accuracy minimum of 70% of token matches from source to target calibrated on the standard bergamot input text is additionally present, ensuring that the English tokens at start and end match exactly. # HTML Pipeline This PR reworks the HTML translation pipeline to be outside response-construction via callbacks.	2022-01-17 13:44:23 +00:00
Jelmer	e061b5613e	Treat most HTML elements as word-breaking (#286 )	2022-01-16 10:26:40 +00:00
Jelmer	13c55e2693	Defer model loading to parallel worker thread (#303 )	2022-01-14 10:30:38 +00:00
Jerin Philip	71b84b7c72	CI guaranteed example documentation (#300 ) * Convert marian-integration markdown to rst * Convert native run into a script, include in rst * Check with CI that the native running example works without fail	2022-01-06 19:10:57 +00:00
Jelmer	dae02a3c8d	HTML transfer script/style/etc elements (#285 )	2022-01-05 13:33:51 +00:00
Jerin Philip	81c21928d5	Have alignments placed if HTML is on (#296 )	2022-01-03 12:27:41 +00:00
Jerin Philip	3883dd1971	cache: threadsafety-fixes; optional stats collection (#245 ) * Make stats hits misses atomic to guard when mutex has multiple buckets * Use compile time switch for cache-stats-collection bound to COMPILE_TESTS cmake variable * -DENABLE_CACHE_STATS on if COMPILE_TESTS otherwise optional * Make stats() call without enabling build fatal abort	2022-01-02 12:33:30 +00:00
Jerin Philip	ddccc77570	Turn logging off by default, allow turning on via config/cmdline (#295 ) * Turn logging off by default, allow turning on via config/cmdline * No need to store config in member variable if things are decided at construction time	2022-01-02 00:17:12 +00:00
Jerin Philip	d209e4fc49	Fix typo in BRT args on CI runs (#294 )	2021-12-30 16:12:30 +00:00
Jerin Philip	8eb238ed5e	HTML basic integration tests (#291 )	2021-12-30 14:29:12 +00:00
Jerin Philip	6e6042c98f	GitHub CI: Update YAML to run all tests on marian-full (#292 ) Previously there were #native tags and #wasm tags separating the two. There is now a clear separation between async, blocking and wasm.	2021-12-29 11:02:56 +00:00
Abhishek Aggarwal	9e1c1e8dbf	CI: Circle CI config script update (#287 ) - Robust artifact presence check - Variable name refactoring - Storing only those artifacts that are required - Remove commit sha from the names of the Github Releases - Use BERGAMOT_VERSION file contents for Git Tag names	2021-12-21 23:58:13 +01:00
Jelmer	f55377b687	HTML transfer empty elements (#283 ) * Fix test case This should now be implemented * Remove FilterEmpty This path wasn't used anymore anyway, empty tags just got their own spans, and never reached the stack. * Insert skipped empty source spans into target HTML Also refactor variable names to better match their contents and be more consistent with each other. This implementation passes all test cases, finally! * Fix remaining style changes * Move HTML formatting to its own section That code had become exact copies in three different places	2021-12-21 13:44:04 +00:00
Jerin Philip	bcbbfe1295	Better command-line with isolation for both Services and co-located defaults and parsing (#252 ) * CLI Rework * Consolidate common tests, template specialize CLI * Remove remnant cache stuff * [BRT]: Run BRT with new cli * Formalizing bridge * Removing stuff from parsing and moving to TestSuite * Template includes, everything consolidating at tests * Inlining readFromStdin * Removing unnecessary headers * Checking in template implementation which was missing * Sane defaults, some catches at BRT * BRT: Install fixes * Updating marian-dev to point to main * Removing the enum indirection, using strings at one place, directly * Fix typo; * [BRT] test blocking service via native * Conservative defaults for workers and cache-mutex buckets in AsyncService * Create proper barriers for cmdline app * Build failure fixes * Moving common, common-impl to a familiar structure * Binary reorganization: async, blocking, wasm - async tests AsyncService - blocking tests BlockingService - wasm arranges tests for things that are Mozilla requirements. eg: - bytearray - multiple sentences in same translate request workflow. * [brt] updates to adapt to cli rework * [brt] updates to adapt to cli rework, all working * Empty commit, sync brt online and run GitHub CI * Switch for parser to have multiple mode or not * [brt]: Fix for --bergamot-mode being removed from CLI app * [brt]: Fix for --bergamot-mode being removed from CLI app * [brt]: Removing remnant faithful translation test from blocking/	2021-12-21 09:22:37 +00:00
Jelmer	1a27a8e0a7	Increase HTML test coverage (#279 ) * Fix bug in HasAlignments check When fixing it to allow empty sentences, it no longer caught misconfigured models. I've added a test that triggers this scenario, and a fix in HasAlignments for it. * Add more unit tests for xh_scanner Trying to increase that code coverage to 100% * Add test for whitespaces around attributes * Make accessing value(), attr_name() and tag_name() at the wrong time safer * Fix bug in <style> and <script> parsing The end tag was never found * Fix parsing of mix of valueless and quoteless attributes * Sync list of void tags with Firefox' implementation of outerHTML and innerHTML Also lets use their name for it: IsVoidTag instead of IsEmptyElement. Empty was a bit ambiguous. * Bring back support for processing instructions support in xh_scanner I noticed in https://searchfox.org/mozilla-central/source/dom/base/nsContentUtils.cpp#8961 that these can be produced by innerHTML under some circumstances. * More permanent link * Use CamelCase for the internal functions I added * Rename _PI to _PROCESSING_INSTRUCTION Your IDE will do the typing for you anyway * Match symbol naming of the rest of code base CapitalCase for classes, camelCase for functions, snake_case for variables still. * Missed one 😴 * Change xhscanner's variable case also to camelCase * Partially fix case variables in html.cpp	2021-12-20 15:24:30 +00:00
Andre Natal	793d132b7c	Adding circle ci job to push the wasm artifacts to github releases (#280 ) * Adding circle ci job to push the wasm artifacts to github releases. * Updated config.yml	2021-12-18 00:05:11 +01:00
Abhishek Aggarwal	8884b39055	Disabled importing optimized gemm module (#282 ) - Until the optimized gemm module stops requiring Shared Array Buffer, we can't really use it in Firefox	2021-12-17 17:39:43 +01:00
Jelmer	420f12b3ff	Remove value length limit from HTML parser & interpolated alignments (#274 ) * Remove InterpolateAlignment And some code improvements * Replace the fixed value buffer with a std::string backing * Fix tests that had no alignment info These depended on the linear interpolation that I removed * Remove arbitrary limits on tag and attribute names This might also fix a bug caused by the eager lower casing of tag names, which could break <![CDATA , <style> and <script> * Remove equals() in favour of operator==() I trust the compiler can come up with better optimisations than I can. * Expose std::strings instead of their data Should save us some std::strlen() calls * Add & remove headers and no-longer-defined functions from header files * Remove all string buffers from xh_scanner It now directly refers to either the input stream or constant strings * Replace custom string_view with even lighter struct that's only used internally To the outside world we just expose std::string_view * Remove __builtin_sub_overflow for MSVC * ABORT if trying to restore HTML when no alignment info is available * Add test cases specifically for xh_scanner Both good for testing regression, and as a little example/reference for what behaviour to expect from it. * Add --html option to bergamot for tests This should make it easier to have some integration tests for HTML input * Add test and fix for empty inputs failing due to alignment check Co-authored-by: Jerin Philip <jerinphilip@live.in>	2021-12-15 22:01:49 +00:00
Nikolay Bogoychev	8563f0856f	Proper arch setting on win32 (#275 ) * Proper arch detection on win32 * Whoops	2021-12-14 23:53:53 +00:00
Abhishek Aggarwal	feb9c90429	Additional logs in JS translation worker (#277 ) - Print source text received in the response - Print no. of block elements in the input	2021-12-14 21:52:00 +01:00
Jerin Philip	571d312930	Constrain mistune to fix docs CI (#278 )	2021-12-14 16:34:30 +00:00
Abhishek Aggarwal	e75a9e1da3	More robust logic to import wasm gemm (#276 ) - Import optimized gemm implementation only if all the necessary functions are provided by it, othewise use the fallback gemm	2021-12-14 16:39:19 +01:00
Abhishek Aggarwal	8e79897f30	Updated configuration for html text translation to work in wasm test page (#269 ) * Updated translator configuration in wasm test page - Added alignment: soft * Set ResponseOptions::alignment to "true" - Had to be set for html text translation to work	2021-12-01 11:32:51 +01:00
Abhishek Aggarwal	e8fd01e9f4	Updated marian-dev submodule	2021-11-30 17:19:42 +01:00
Jelmer	eea5554b91	HTML handling improvements (#266 ) * Fix out-of-bounds error when determining alignment for whole word If token at offset 0 was a continuation (which it always is, since the first word of a sentence does not start with a space) it would jump to (unsigned) -1 which is probably out of bounds. * Don't segfault if alignment info is not available When alignment info is requested, but model is missing `alignment: soft` you'd get empty alignment info for every target token. * Partial fix for handling empty elements This fixes a parse error when dealing with something like `<p>...<br></p>` or `...<br>` where there is no text after the last empty element. This also prevents losing empty elements in the source side of the translation. Empty elements are not yet transferred correctly to the target side. * Fix formatting	2021-11-29 08:41:24 +00:00
Kenneth Heafield	40366162d8	HTML input (#253 ) Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl> Co-authored-by: Abhishek Aggarwal <aaggarwal@mozilla.com>	2021-11-25 13:57:50 +00:00
Abhishek Aggarwal	2b1b0531ff	Import optimized gemm implementation (when available) for wasm target (#265 ) * Enable importing optimized gemm module for wasm - Updated emscripten generated JS code to -- import and use the optimized gemm module when available, otherwise use fallback gemm implementation * Added logging for gemm implementation being used for wasm target	2021-11-17 09:18:55 +01:00
Abhishek Aggarwal	f9e55b3cd8	Make script run from any directory (#262 ) * Make script run from any directory	2021-11-15 22:30:52 +01:00
Andre Natal	d6a14b1d6f	Fix badge to point to this repo instead mozilla's (#261 )	2021-11-15 08:14:21 +00:00
Jerin Philip	5a693b7eec	Fixes windows workflow for PCRE2 (#260 )	2021-11-05 20:48:28 +00:00
Jerin Philip	fa4efb483b	Update ssplit cpp, pcre2 source compile to fix broken builds (#258 ) * Update ssplit cpp, pcre2 source compile to fix tests * Syncing with browsermt/ssplit-cpp * Removing accidental binary inclusion * Removing brt accidental update by git add -u * Fix windows workflow, vcpkg is broken use our cmake route * [ssplit-cpp] Try searching different library names for Windows	2021-11-05 16:46:03 +00:00
Abhishek Aggarwal	7693a1d007	Updated marian submodule (#256 )	2021-11-03 13:54:48 +01:00
Jerin Philip	0bb8095bca	Deprecate hardAlignment in favour of softAlignment (#250 )	2021-11-01 19:21:28 +00:00
Jerin Philip	806169c822	Recover logging (#226 )	2021-11-01 16:31:01 +00:00
Abhishek Aggarwal	c5bc3f5191	Update config "skip-cost" to enable log probabilities for QE scores (#247 ) - Updated wasm test page	2021-11-01 13:06:23 +01:00
Jerin Philip	9b443997e2	EXCLUDE_FROM_ALL for marian and ssplit-cpp 3rd-party libraries (#243 )	2021-10-31 12:33:42 +00:00
Jerin Philip	47e57c95a6	[ssplit-cpp] Enable position independent library when compiled from sources (#240 )	2021-10-29 13:40:28 +01:00
Jerin Philip	45412ce7de	Set PR to any branch to trigger workflows (#230 )	2021-10-28 09:30:02 +01:00
Jerin Philip	2b98c67996	Cache for translations (#227 ) Sets a cache to operate for each sentence that a TranslationModel process caching the corresponding marian::History for a {TranslationModel::Id, marian::Words} key. Cache is thus shared across multiple TranslationModels bound to the lifetime of a Service. Cache gracefully downgrades in the case of WebAssembly.	2021-10-27 20:37:05 +01:00
Abhishek Aggarwal	d0d08c0f54	JS bindings for Quality Estimation (#239 ) * Quality Score bindings complete * Updated wasm test page to test the bindings - Word and sentence scores can be seen in browser console	2021-10-27 19:26:55 +02:00
Abhishek Aggarwal	c5167b3d8c	Import matrix-multiply from a separate wasm module (#232 ) * Updated marian-dev submodule * Import wasm gemm from a separate wasm module - The fallback implementation of gemm is currently being imported dynamically for wasm target * Updated CI scripts and README to import GEMM from a separate wasm module * Setting model config to int8shiftAlphaAll in wasm test page	2021-10-27 11:54:39 +02:00
Abhishek Aggarwal	a0cb1e4b3d	Wasm test page UI for translating b/w non-English language pairs (#231 ) * Updated Wasm test page UI for translating b/w non-English language pairs * Both "from" and "to" language dropdowns now allow non-English languages	2021-10-19 14:40:54 +02:00
Abhishek Aggarwal	c7b626dfd0	Adapted wasm test page for new Service interface (#224 ) - The new interface now supports running multiple TranslationModels	2021-09-28 15:53:02 +05:30
Jerin Philip	cf541c68f9	Multiple TranslationModels Implementation (#210 ) For outbound translation, we require having multiple models in the inventory at the same time and abstracting the "how-to-translate" using a model out. Reorganization: TranslationModel + Service. The new entity which contains everything required to translate in one direction is `TranslationModel`. The how-to-translate blocking single-threaded mode of operation or async multi-threaded mode of operation is decoupled as `BlockingService` and `AsyncService`. There is a new regression-test using multiple models in conjunction added, also serving as a demonstration for using multiple models in Outbound Translation. WASM: WebAssembly due to the inability to use threads uses `BlockingService. Bindings are provided with a new API to work with a Service, and multiple TranslationModels which the client (JS extension) can inventory and maintain. Ownership of a given `TranslationModel` is shared while translations using the model are active in the internal mechanism. Config-Parsing: So far bergamot-translator has been hijacking marian's config-parsing mechanisms. However, in order to support multiple models, it has become impractical to continue this approach and a new config-parsing that is bergamot specific is provisioned for command-line applications constituting tests. The original marian config-parsing tooling is only associated with a subset of `TranslationModel` now. The new config-parsing for the library manages workers and other common options (tentatively). There is a known issue of: Inefficient placing of workspaces, leading to more memory usage than what's necessary. This is to be fixed trickling down from marian-dev in a later pull request. This PR also brings in BRT changes which fix speed-tests that were broken and also fixes some QE outputs which were different due to not using shortlist.	2021-09-21 18:10:40 +01:00
Andre Barbosa	63120c174e	QualityEstimation: Preliminary Implementation (#197 ) Unifies quality estimation with an interface, refactors previously available quality scores to fit this interface. Adds a new class of model with Logistic Regression powering the predictions as an implementation of said interface. QE now provides annotations on words using subwords to word rule-based algorithms working with space characters. QualityEstimation ----------------- Implementations of QE are bound together by a `QualityEstimator` Interface. 1. The log-probabilities from the machine-translation model re-interpreted as quality scores are crafted as an implementation of QualityEstimator. 2. A Logistic-Regression based model is added. This class of models is trained supervised with scores labeled by a human annotator. Handcrafted features - number of words, log probs from MT model and statistics over the sequence are used to generate the numeric features. LogisticRegressor, Matrix (to hold features) are added. The creation of an instance is switched by the `AlignedMemory` supplied (be it loaded from the file-system or supplied as a parameter). An empty AlignedMemory leads to quality scores from NMT while supplying weights of a trained logistic-regression model in binary format as the contents lead to an additional pass through the said model to provide more refined scores. Both the above now transform subwords into "words" using a heuristic algorithm, scanning for spaces. This allows the client to work with "words" to denote quality instead of subwords, as the former is more sensible to the user. Testing ------- 1. BRT now has two new test apps to check the QE outputs in text (covers subword to words) and numbers domain (covers quality scores). These are tested with en-et models for which QualityEstimation is available now, on a new input to avoid architecture/compiler issues. 2. Unit test for LogisticRegression model is added. Docs ---- Doxygen now supports MathJax properly to render explanations for Logistic Regressions' reductions in place to make computation more efficient correctly. Co-authored-by: Felipe C. Dos Santos <felipe.santos.k@gmail.com> Co-authored-by: Jerin Philip <jerinphilip@live.in>	2021-09-16 16:28:40 +01:00
Jerin Philip	48e955c468	BRT: Update sacrebleu to get tests back working (#217 ) Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>	2021-09-07 19:10:41 +01:00
Abhishek Aggarwal	8e4374282a	Circle CI wasm artifacts for non-wormhole builds	2021-08-31 17:01:52 +02:00
Abhishek Aggarwal	cafb65e0b5	Wasm builds without SharedArrayBuffer	2021-08-27 09:07:06 +02:00
Abhishek Aggarwal	ff391c6f00	Updated marian submodule to latest commit of master	2021-08-27 09:07:06 +02:00

1 2 3 4 5 ...

408 Commits