This does a lazy init of the two cuda handles that we are using on the GPU. When initialized every eagerly cusparse will consume about 250MB of CPU RAM and about 75MB of GPU RAM. Should only be used when actually needed.
This is the first PR in a series to get my graph-group* clean-up into master. End-goal is to have clean label-based updates and fp16 in master.
This small change guards against circular references in the scheduler. For large models the freeing order is wrong which can prevent GPU memory to be freed at the correct time leaving no memory for the actual model. The weak references solve that.
* use lowest() as INVALID_PATH_SCORE insteaf of -9999 which caused problems with very long sequences
* add a number of aborts in relation of invalid path scores during beam search
* Compile marian on mac and clang. Two linker errors left
* MacOS defines has a different definition for unsigned long
* Find OpenBLAS on mac
* Fix a typo in the BLAS detection
* Simplify and add comments
* Refactor cpu allocation code. Do not fallback to malloc
* Fix compilation warning on gcc
* Refactor memory allocation
* Make things compile with clang-8 with fewer warnings.
* Eliminate clang warnings when compiling examples and when compiling without MKL
* added USE_MKL option to compile without MKL for debugging even when MKL is installed
* fixed issues with compiling examples with clang
* Fix compile errors with clang in src/tests.
* Fix missing whitespace in error message in src/tests/sqlite.cpp.
* Responding to Frank Seide's code review.
* Eliminate clang warnings when compiling with -DUSE_FBGEMM=on.
* Fix compilation on gcc 8
* Get Marian to compile with Clang-10.
* Fix Clang-8 warnings when compiling with marian-server
* Add more comments and explicit unsigned long long for windows
* Pull in fbgemm that supports mac
* Fix warning flags order in CMakeLists.txt
Co-authored-by: Kenneth Heafield <kpu@users.noreply.github.com>
Co-authored-by: Ulrich Germann <ulrich.germann@gmail.com>
Co-authored-by: Roman Grundkiewicz <romang@amu.edu.pl>
Fixes empty line handling with factored segmenter. In my previous PR where I fixed general empty line handling I misunderstood the relation between WordIndex and factors and did an incorrect inverse look-up of the word index of EOS. Should be fixed now for FS, Should be no change when not using FS.
The previous mechanism to remove empty inputs does not play well with batch purging (removal of finished sentences). Now we reuse the batch purging mechanism to get rid of empty inputs by forcing EOS for all beam entries of a batch entry for the corresponding source batch entry. The purging then takes care of the rest. We set the probability to log(1) = 0.
* Clears cache for RNN object in transformer, otherwise stale tensor might be kept around.
* Add missing `hash()` and `equal` functions everywhere.
* Fixes bug from deployment test.
Splitting up header file into header and *.cu, comes with the price of having to include specializations for combinations of types as for element.inc and add.inc. No code changes otherwise.
Add CMake options to disable specific compute capabilities.
When run with `make -j16` this compiles in about 6 minutes instead of 7 minutes. Selecting only SM70 during compilation brings down the time to 3 minutes.
* Downgrade NCCL to 2.3.7 as 2.4.2 is buggy (hangs with larger models)
* Actually enable gradient-checkpointing, previous option was inactive
* Clean-up training-only options that should not be displayed for decoder and scorer
* Re-enable conversion to FP16 if element types are compatible (belong to the same type class)
* A few typos and more verbose log messages.
* Add printing word level scores
* Add option --no-spm-decode
* Fix precision for word-level scores
* Fix getting the no-spm-decode option
* Update CHANGELOG
* Add comments and refactor
* Print word-level scores next to other scores in an n-best list
* Remove --word-scores from marian-scorer
* Add --no-spm-decode only if compiled with SentencePiece
* Add comments
* Printing word scores before model scores in n-best lists
* Update VERSION
Co-authored-by: Marcin Junczys-Dowmunt <Marcin.JunczysDowmunt@microsoft.com>
For FBGEMM based int8 implementation, packed matrix (model) could be different based on available AVX instruction sets. This PR split packed8 format into two separate data formats (packed8avx2, packed8avx512). And, this enables any packed model can be generated on any machine.
* Added packed8avx2, packed8avx512 types, removed packe8 type
* Added blocking factors to the fbgemm interface based on the pack type for pack function and gem functions.
This adds two new features related to factored vocabs:
* a new conditioning mechanism that mimics a mini transformer layer between the emitted lemma and the factors. This affects only factored vocabs, and requires to be enabled explicitly. Change is in `generic.cpp`.
* in case of inline phrase-fixing, cross-attention is now no longer allows to look into the source sequence. This only affects inputs with `|is` factors or `<IOPEN>` tags. Change is in `states.h`.
* Adam optimizer now skips update if the gradient contains a NaN. Does not affect existing configs unless they produce NaNs. Change is in `optimizers.cpp`.
* reverts to old `LayerNorm` routine. *TODO*: Is this change still needed?
Additional changes:
* new method `locate()` for accessing batch data with array coordinates
* new overloads for `constant_like()` from vector directly (most used case)
* rvalue-ref version of `fromVector()`
This implements Sequential Unlikelihood Training from https://arxiv.org/abs/1908.04319
* implementation as expensive multi-op, special node in-progress.
* fixed gather operator to work in batched cases