Using std::string for config. Now capable of launching marian translator
through API interface. There's a sketchy workaround to convert a string
config to marian::Options, with an added note.
Enables Mac and Ubuntu CPU only builds through GitHub CI. CI scripts are
copied from marian-dev with necessary changes.
3rd-party/marian-dev is modified to meet C++17 requirements modifying
for half_float.
Only the bergamot-translator library should be linked to main target
Any other library (marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit
pcrecpp.a pcre.a) should be linked to bergamot-translator target inside
src/translator folder.
- Truncating long sentences into those of a specified length for faster
processing is now a separate function, for improved readability.
- Changes doing push_back -> emplace_back at places to avoid copy.
- query_to_segments is renamed as process.
- Comments are added in an attempt to bring some sanity.
Vocabs was earlier loaded in each thread and copied several times.
Modified this to be loaded only once in Service and reference used
consistently later on.
This change makes Tokenizer as a class rather moot, as there's only one
private member and a function. Moved this into TextProcessor.
SentenceSplitter, however remains a separate class.
utils.{h,cpp} had only a single loadVocabularies function, which
is at the moment required only in Service. Making loadVocabularies a
function inside Service and getting rid of utils.*.
A faster linesplitter added for benchmarks is removed in favour of @ug's
ssplit-cpp.
NOTE: ssplit-cpp's regex based implementation is slow for one-line
parses, which ideally needs to be improved in upstream ssplit-cpp to
trivially reduce to a faster newline character based split.
CMakeLists have been modified with the necessary includes to add
browsermt/mts@nuke files to the bergamot-translator library. In
addition, adds the ssplit dependency, corresponding includes.
Intel MKL fails on compilation, unable to find libraries. To solve this
3rd_party/CMakeLists.txt is modified with @ug's fixes to propogate
variables (EXT_LIBS, etc) at a library level.
Modifications to SentencePiece are necessary to provide token level
string_views. This commit changes marian to an alternate branch which
has the feature incorporated.
- Marian uses Options class everywhere as configuration options
- Owing to this project's heavy dependency on Marian:
-- Made the internal implementation files of the project work
with marian::Options instead of TranslationModelConfiguration
-- An Adaptor class to adapt TranslationModelConfiguration
to marian::Options will be added in following commit
- This class is an implementation of AbstractTranslationModel
interface
- This is the main class that will implement the translate API
- Contains dummy responses for now
- Contains classes for the API specification (doc/Unified_API.md)
- Things to be changed/decided later:
Use of std::string_view to represent ranges
Adding Alignment information
Basic Setters and Getters for some of the classes