rust-bert

mirror of https://github.com/guillaume-be/rust-bert.git synced 2024-09-11 12:55:34 +03:00

Author	SHA1	Message	Date
guillaume-be	84561ec82b	Tokenizer special token map update (#330 ) * Updates for compatibility with tokenizers special token rework * Updated mask pipline methods * Bumped version * Fix clippy warnings	2023-01-30 17:53:18 +00:00
Åke Amcoff	80e0197e2c	Generalize input of NERModel::predict_full_entities (#329 )	2023-01-23 20:24:18 +00:00
dependabot[bot]	e1e8fc615d	Bump torch from 1.13.0 to 1.13.1 (#328 ) Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.0 to 1.13.1. - [Release notes](https://github.com/pytorch/pytorch/releases) - [Changelog](https://github.com/pytorch/pytorch/blob/master/RELEASE.md) - [Commits](https://github.com/pytorch/pytorch/compare/v1.13.0...v1.13.1) --- updated-dependencies: - dependency-name: torch dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-01-22 09:25:33 +00:00
guillaume-be	f1b8409ca8	0.20.0 release (#327 ) * Relax dependencies, fix doctests * Bump tokio for future rust compat, fix tests * Updated changelog	2023-01-22 09:00:14 +00:00
guillaume-be	0fc5ce6ad4	CodeBERT Pretrained models and examples (#322 ) * Addition of Codebert examples * Addition of CodeBERT pretrained models, CodeBERT example	2023-01-20 19:02:33 +00:00
guillaume-be	f12e8ef475	Aligned ModelForTokenClassification and ModelForSequenceClassification APIs (#323 )	2023-01-15 11:10:38 +00:00
Andreas Haufler	445b76fe7b	Permits TokenClassificationOption / DistilBertForTokenClassification to fail gracefully for an invalid configuration. (#320 ) * Properly handle config errors when creating classification models Instead of panic, we now return a proper RustBertError so that an invalid model or config thereof wouldn't crash the whole system. * Properly formats newly added code. * Fixes an example within the documentation * Properly unwraps the newly created results in unit tests. * Fixes some code formatting issues. * Uses proper/idiomatic error handling in unit tests. * Moves the "?" to the correct position.	2023-01-15 10:32:57 +00:00
guillaume-be	2c4a79524d	Update README.md (#318 )	2023-01-05 15:37:31 +00:00
guillaume-be	3bf86331ca	- Updated `cached-path` dependency (#310 )	2022-12-22 08:51:43 +00:00
guillaume-be	fdf5503163	Fixed Clippy warnings (#309 ) * - Fixed Clippy warnings - Updated `tch` dependency - Updated README to avoid confusion with respect to the required `LIBTORCH` version for the repository and published package versions * Fixed Clippy warnings (2) * Fixed Clippy warnings (3)	2022-12-21 17:52:26 +00:00
Vincent Xiao	dae899fea6	Add pipelines::masked_language and codebert support (#282 ) * ad support for loading local moddel in SequenceClassificationConfig * adjust config to match the SequenceClassificationConfig * add piplines::masked_language * add support and example for codebert * provide an optional mask_token String field for asked_language pipline * update example for masked_language pipeline * codebert support revocation * revoke support for loading local moddel * solve conflicts * update MaskedLanguageConfig * fix doctest error in zero_shot_classification.rs * MaskedLM pipeline updates * fix multiple masked token, added test * Updated changelog and docs Co-authored-by: Guillaume Becquin <guillaume.becquin@gmail.com>	2022-12-21 16:58:02 +00:00
Anna Melnikov	a34cf9f8e4	Make predict methods in ZeroShot pipeline return Result instead of panicking on unwrap (#301 ) * Add checked prediction methods - Add checked prediction methods to ZeroShotClassificationModel. These methods return Option and convert any underlying errors into None, to allow callers to implement appropriate error handling logic. * Update ZeroShot example to use checked method. * Add tests for ZeroShot checked methods * Change checked prediction methods to return Result * refactor: rename _checked into try_ Rename _checked methods into try_ methods. This is more idiomatic vis-a-vis the Rust standard library. * refactor: remove try_ prefix from predict methods * refactor: change return from Option to Result Change return type of ZeroShotClassificationModel.prepare_for_model from option into Result. This simplifies the code, and returns the error closer to its origin. This addresses comments from @guillaume-be. * refactor: address clippy lints in tests Co-authored-by: guillaume-be <guillaume.becquin@gmail.com>	2022-12-04 09:10:01 +00:00
guillaume-be	a0ef06bccf	revert to windows 2019 pending 2022 fix (#304 )	2022-12-03 11:31:22 +00:00
guillaume-be	a93b406334	Update CI to build without defaults (#299 ) * Avoid attention copy, fix remote feature * revert conversation change * Updated CI for build without defaults	2022-11-20 09:39:00 +00:00
guillaume-be	05367b4df2	Make `max_length` optional (#296 ) * Made `max_length` an optional argument for generation methods and pipelines * Updated changelog	2022-11-15 19:20:51 +00:00
guillaume-be	5d2b107e99	Keyword/Keyphrase extraction (#295 ) * stop word tokenizer implementation * - Addition of all-mini-lm-l6-v2 * initial implementation of keyword scorer * Cosine Similarity keyword extraction * Added lower case parsing from tokenizer config for sentence embeddings * Initial draft of pipeline complete * Addition of Maximal Marginal relevance scorer * Addition of Max Sum scorer * Lowercase and ngrams handling * Improved n-gram handling * Skip n-grams containing stopwords * Fixed short sentence input and added documentation * Updated documentation and defaults, added example * Addition of tests for keywords extractions * Updated changelog * Fixed Clippy warnings	2022-11-13 08:51:10 +00:00
guillaume-be	2ffb6005f3	- Addition of all-mini-lm-l6-v2 (#294 )	2022-11-09 17:51:31 +00:00
guillaume-be	c6771d3992	Update to `tch=0.9.0` (#293 ) * Fixed short sentence input and added documentation * Fixed Clippy warnings * Updated CI Python version * cleaner dim specification	2022-11-07 17:45:52 +00:00
guillaume-be	340be36ed9	Mixed resources (#291 ) * - made `merges` resource optional for all pipelines - allow mixing local and remote resources for pipelines * Updated changelog * Fixed Clippy warnings	2022-10-30 07:39:52 +00:00
guillaume-be	78da0f4814	Added example for 2-turn conversation in docs (#287 )	2022-10-08 09:31:20 +01:00
guillaume-be	c9f7e8653d	- Fixed 1.64 Clippy warnings (#284 )	2022-10-08 08:38:10 +01:00
guillaume-be	59c0e668d6	- Fixed RoBERTa confic checks for sentence classification (#281 )	2022-09-10 08:50:25 +01:00
guillaume-be	e4a2a102e0	Rust 1.63 update (#280 ) * Fix new clippy warnings, addition of type aliases * Docstrings formatting * Fixed doctests	2022-09-05 19:08:08 +01:00
guillaume-be	cce1e2707d	Prepare for 0.19 release (#272 )	2022-07-25 06:36:02 +01:00
guillaume-be	66d596a2bf	Handle empty generation inputs (#268 ) * Handle zeo-length slice input for text generation (alias to None) * Handle zeo-length slice input for text generation (return zero-length vector)) * Handle zero-length prompts (seed with BOS token) * Updated zero-length prompts attention mask to match Python reference * Fix clippy warnings	2022-07-17 08:53:08 +01:00
guillaume-be	76a94d48f0	Update to torch 1.12 (#266 ) * Update to torch 1.12 * updated readme	2022-07-09 09:55:57 +01:00
guillaume-be	ed4f4f457d	Updated README Libtorch link to CXX11 ABI (#264 )	2022-07-04 17:43:20 +01:00
guillaume-be	a1595e6dfd	Updated sentence embeddings example (#263 ) * Added conversion information for Distil-based sentence embedding models * Fix Clippy warnings	2022-07-03 08:48:31 +01:00
Romain Leroux	4d8a298586	Add sbert implementation for inference (#250 ) * Add sbert implementation for inference * Fix clippy warnings * Refactor sentence embeddings into a dedicated pipeline * Add output_attentions and output_hidden_states to T5Config * Add sbert implementation for inference * Fix clippy warnings * Refactor sentence embeddings into a dedicated pipeline * Add output_attentions and output_hidden_states to T5Config * Improve sentence_embeddings implementation * Dedicated tokenizer config for strip_accents and add_prefix_space * Rename forward to encode_as_tensor * Remove _conf from Dense layer * Add sentence embeddings docs * Addition of remote resources and tests update * Merge feature branch and fix doctests * Add SentenceEmbeddingsBuilder<Remote> and improve remote resources * Use tch::no_grad in sentence embeddings * Updated changelog, registration of sentence embeddings integration tests Co-authored-by: Guillaume Becquin <guillaume.becquin@gmail.com>	2022-06-21 20:24:09 +01:00
guillaume-be	6b20da41de	0.18.0 release (#257 ) * Fixed Clippy warnings * Revert "Shallow clone optimization (#243)" This reverts commit `ba584653bc`. * prepare for 0.18.0 release	2022-05-29 12:59:00 +01:00
guillaume-be	6556ee3f13	`tch 0.7.2` update (#256 ) * Fixed Clippy warnings * Revert "Shallow clone optimization (#243)" This reverts commit `ba584653bc`. * - Updated `tch` backed version * reverted spurious xlnet changes	2022-05-19 19:02:38 +01:00
sftse	3df5ea5d37	Add token offset information to entities (#255 ) * Add token offset information to entities * replace unwrap by error propagation Co-authored-by: guillaume-be <guillaume.becquin@gmail.com>	2022-05-18 20:44:50 +01:00
Mark Lodato	e5a51b0176	Use `resize` instead of `append` to pad features (#254 ) * Use `resize` instead of `append` to pad features This Commit Updates various instances of `append(&mut vec![...])` with `resize(...)`. Why? As a micro optimization. I don't expect this to affect any benchmarks since the change will be so small compared to the time it takes the model to do anything but I ran a small benchmark and this seemed to be the fastest way to do this because (I think): - we allocate each attention mask exactly once to the correct capacity - we don't allocate a new vector to append to the existing one[^1] And, while this won't speed up anything in practice, I think it might read more clearly since `resize` tells you the final length so we can see that all the vectors are the same final length and match `max_len`. <details> <summary>Micro benchmark run</summary> The rust code was roughly: ```rust pub fn append(len: usize, max: usize) -> Vec<usize> { let mut v = vec![1; len]; v.append(&mut vec![0; max - v.len()]); v } pub fn resize(len: usize, max: usize) -> Vec<usize> { let mut v = Vec::with_capacity(max); v.resize(len, 1); v.resize(max, 0); v } pub fn halfway(len: usize, max: usize) -> Vec<usize> { let mut v = vec![1; len]; v.resize(max, 0); v } pub fn overwrite(len: usize, max: usize) -> Vec<usize> { let mut v = vec![1; max]; v[len..max].fill(0); v } ``` and the parameters were roughly: ```rust for size in [10, 500, 1000, 5000] { for max in [size, size + 1, size * 2, size * 10] { ``` and `resize` was consistently the fastest. `halfway` was similar most of the time but consistently slightly slower. `overwrite` was slower than those for reasons I don't understand and `append` was consistently the slowest (though, of course, the difference was very small when we were appending zero or one elements). </details> [^1]: I can't really read assembly but in [this small godbolt example][0] I see `__rust_alloc`, `__rust_alloc_zeroed`, and `do_reserve_and_handle` so I don't think the compiler is seeing the upcoming allocation and handling it all on the initial allocation. [0]: https://godbolt.org/z/eTsnjn9Tq * Padding simplification for sequence generation pipelines * Move call to `.get_pad_id` outside loop Why? Because it's the same for every iteration. See [this comment][0] for more details. [0]: https://github.com/guillaume-be/rust-bert/pull/254/files#r873138871 * Remove comments on `pad_features` Why? I tried to add some comments but didn't understand the problem space well enough to correctly document what the returned masks do. See [this comment][0] for more details. [0]: https://github.com/guillaume-be/rust-bert/pull/254/files#r873138314 Co-authored-by: Guillaume Becquin <guillaume.becquin@gmail.com>	2022-05-16 17:45:05 +01:00
guillaume-be	8c81ab4207	Make mode token aggregation deterministic (#253 ) * Fixed Clippy warnings * Revert "Shallow clone optimization (#243)" This reverts commit `ba584653bc`. * Made mode aggregation deterministic * revert spurious change * Moved BERT test to CPU allowing test suite to run with <8GB accelerator buffer * Update src/pipelines/token_classification.rs Co-authored-by: Mark Lodato <marklodato0@gmail.com> * Update src/pipelines/token_classification.rs Co-authored-by: Mark Lodato <marklodato0@gmail.com> Co-authored-by: Mark Lodato <marklodato0@gmail.com>	2022-05-12 18:00:38 +01:00
Mark Lodato	c5faadcdf0	Use `Vec` in place of `HashMap<usize, T>` (#252 ) This Commit Attempts to simplify the `predict` function in the `token_classification` pipeline by substituting a `HashMap` whose keys are indices into a `Vec`. Why? Because the `HashMap` eagerly creates token buckets for all indices from `0..input.len()` we can get the same behavior by using a `Vec`. This cleans up some later code that was sorting on index because the `Vec` maintains order by index naturally. Note I also switched from `get_mut().unwrap()` to `[]` notation because it was the same but shorter. Happy to revert that if the `get_mut().unwrap()` is specifically preferred for quickly finding panic points by grepping for `unwrap` or something! Note I wrote a benchmark and it didn't seem to make it faster or slower but hopefully that benchmark will be slightly helpful to those in the future 🤞.	2022-05-10 18:59:55 +01:00
guillaume-be	b49d853b20	Configuration defaults (#251 ) * Fixed Clippy warnings * Revert "Shallow clone optimization (#243)" This reverts commit `ba584653bc`. * Addition of defaults for Albert and BART configs * Addition of BERT, DeBERTa, DeBERTaV2 default configs * Addition of DistilBERT, Electra, FNet, GPT2, GPTNeo, Longformer configs * Addition of default for all model configurations * Cleaned-up config aliases and fixed tests * Fixed Clippy warnings	2022-04-25 17:59:45 +01:00
lsb	04b25923ad	Update mobilebert_model.rs (#249 ) * Update mobilebert_model.rs Fix typo in comments * Update fnet_model.rs Same typo	2022-04-21 18:39:16 +01:00
Romain Leroux	4162cf1c3c	Fix for ALBERT Attention (#246 ) Selects the correct Option<Tensor> for attention in AlbertTransformer. Besides all_hidden_states might also be incorrect. Co-authored-by: guillaume-be <guillaume.becquin@gmail.com>	2022-04-16 18:40:24 +01:00
guillaume-be	eff7082150	Revert shallow clone (#247 ) * Fixed Clippy warnings * Revert "Shallow clone optimization (#243)" This reverts commit `ba584653bc`. * Fixed Clippy warning	2022-04-15 12:19:42 +01:00
guillaume-be	ba584653bc	Shallow clone optimization (#243 ) * Fixed Clippy warnings * Shallow clone optimization (reduce tensor copy) * Updated changelog and fixed Clippy warnings	2022-04-10 08:52:37 +01:00
guillaume-be	6f1888e8f9	Merge pull request #242 from guillaume-be/deberta_v2_implementation Deberta v2 implementation	2022-04-03 10:01:14 +01:00
Guillaume Becquin	dfc96c9b9f	Updated tokenizers dependency	2022-04-02 09:22:06 +01:00
Guillaume Becquin	253d51ba83	Added tests, documentation and fixed Clippy warnings	2022-03-27 09:12:32 +01:00
Guillaume Becquin	c2ba629201	Deberta V2 registration in pipelines	2022-03-27 08:40:37 +01:00
Guillaume Becquin	c7eea6f7ad	Addition od Deberta V2 task-specific heads	2022-03-26 16:49:20 +00:00
Guillaume Becquin	3963d3a59f	Addition of DebertaV2Model and DebertaV2ForMaskedLM	2022-03-26 09:22:43 +00:00
Guillaume Becquin	61f12ae671	Implementation of DebertaV2Conv	2022-03-22 20:53:07 +00:00
Guillaume Becquin	eff522e363	Implementation of DebertaV2Encoder	2022-03-20 09:31:42 +00:00
Guillaume Becquin	ad77d4140c	Merge remote-tracking branch 'origin/master' into deberta_v2_implementation # Conflicts: # Cargo.toml	2022-03-20 08:34:16 +00:00
guillaume-be	641162871a	Merge pull request #238 from guillaume-be/individual_token_scores Return individual token scores	2022-03-20 08:04:35 +00:00

1 2 3 4 5 ...

1061 Commits