Commit Graph

1061 Commits

Author SHA1 Message Date
guillaume-be
84561ec82b
Tokenizer special token map update (#330)
* Updates for compatibility with tokenizers special token rework

* Updated mask pipline methods

* Bumped version

* Fix clippy warnings
2023-01-30 17:53:18 +00:00
Åke Amcoff
80e0197e2c
Generalize input of NERModel::predict_full_entities (#329) 2023-01-23 20:24:18 +00:00
dependabot[bot]
e1e8fc615d
Bump torch from 1.13.0 to 1.13.1 (#328)
Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.0 to 1.13.1.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/master/RELEASE.md)
- [Commits](https://github.com/pytorch/pytorch/compare/v1.13.0...v1.13.1)

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-22 09:25:33 +00:00
guillaume-be
f1b8409ca8
0.20.0 release (#327)
* Relax dependencies, fix doctests

* Bump tokio for future rust compat, fix tests

* Updated changelog
2023-01-22 09:00:14 +00:00
guillaume-be
0fc5ce6ad4
CodeBERT Pretrained models and examples (#322)
* Addition of Codebert examples

* Addition of CodeBERT pretrained models, CodeBERT example
2023-01-20 19:02:33 +00:00
guillaume-be
f12e8ef475
Aligned ModelForTokenClassification and ModelForSequenceClassification APIs (#323) 2023-01-15 11:10:38 +00:00
Andreas Haufler
445b76fe7b
Permits TokenClassificationOption / DistilBertForTokenClassification to fail gracefully for an invalid configuration. (#320)
* Properly handle config errors when creating classification models

Instead of panic, we now return a proper RustBertError so that
an invalid model or config thereof wouldn't crash the whole system.

* Properly formats newly added code.

* Fixes an example within the documentation

* Properly unwraps the newly created results in unit tests.

* Fixes some code formatting issues.

* Uses proper/idiomatic error handling in unit tests.

* Moves the "?" to the correct position.
2023-01-15 10:32:57 +00:00
guillaume-be
2c4a79524d
Update README.md (#318) 2023-01-05 15:37:31 +00:00
guillaume-be
3bf86331ca
- Updated cached-path dependency (#310) 2022-12-22 08:51:43 +00:00
guillaume-be
fdf5503163
Fixed Clippy warnings (#309)
* - Fixed Clippy warnings
- Updated `tch` dependency
- Updated README to avoid confusion with respect to the required `LIBTORCH` version for the repository and published package versions

* Fixed Clippy warnings (2)

* Fixed Clippy warnings (3)
2022-12-21 17:52:26 +00:00
Vincent Xiao
dae899fea6
Add pipelines::masked_language and codebert support (#282)
* ad support for loading local moddel in SequenceClassificationConfig

* adjust config to match the SequenceClassificationConfig

* add piplines::masked_language

* add support and example for codebert

* provide an optional mask_token String field for asked_language pipline

* update example for masked_language pipeline

* codebert support revocation

* revoke support for loading local moddel

* solve conflicts

* update MaskedLanguageConfig

* fix doctest error in zero_shot_classification.rs

* MaskedLM pipeline updates

* fix multiple masked token, added test

* Updated changelog and docs

Co-authored-by: Guillaume Becquin <guillaume.becquin@gmail.com>
2022-12-21 16:58:02 +00:00
Anna Melnikov
a34cf9f8e4
Make predict methods in ZeroShot pipeline return Result instead of panicking on unwrap (#301)
* Add checked prediction methods

- Add checked prediction methods to ZeroShotClassificationModel.
These methods return Option and convert any underlying errors into None,
to allow callers to implement appropriate error handling logic.

* Update ZeroShot example to use checked method.

* Add tests for ZeroShot checked methods

* Change checked prediction methods to return Result

* refactor: rename *_checked into try_*

Rename *_checked methods into try_* methods.
This is more idiomatic vis-a-vis the Rust standard library.

* refactor: remove try_ prefix from predict methods

* refactor: change return from Option to Result

Change return type of ZeroShotClassificationModel.prepare_for_model
from option into Result. This simplifies the code, and returns
the error closer to its origin.

This addresses comments from @guillaume-be.

* refactor: address clippy lints in tests

Co-authored-by: guillaume-be <guillaume.becquin@gmail.com>
2022-12-04 09:10:01 +00:00
guillaume-be
a0ef06bccf
revert to windows 2019 pending 2022 fix (#304) 2022-12-03 11:31:22 +00:00
guillaume-be
a93b406334
Update CI to build without defaults (#299)
* Avoid attention copy, fix remote feature

* revert conversation change

* Updated CI for build without defaults
2022-11-20 09:39:00 +00:00
guillaume-be
05367b4df2
Make max_length optional (#296)
* Made `max_length` an optional argument for generation methods and pipelines

* Updated changelog
2022-11-15 19:20:51 +00:00
guillaume-be
5d2b107e99
Keyword/Keyphrase extraction (#295)
* stop word tokenizer implementation

* - Addition of all-mini-lm-l6-v2

* initial implementation of keyword scorer

* Cosine Similarity keyword extraction

* Added lower case parsing from tokenizer config for sentence embeddings

* Initial draft of pipeline complete

* Addition of Maximal Marginal relevance scorer

* Addition of Max Sum scorer

* Lowercase and ngrams handling

* Improved n-gram handling

* Skip n-grams containing stopwords

* Fixed short sentence input and added documentation

* Updated documentation and defaults, added example

* Addition of tests for keywords extractions

* Updated changelog

* Fixed Clippy warnings
2022-11-13 08:51:10 +00:00
guillaume-be
2ffb6005f3
- Addition of all-mini-lm-l6-v2 (#294) 2022-11-09 17:51:31 +00:00
guillaume-be
c6771d3992
Update to tch=0.9.0 (#293)
* Fixed short sentence input and added documentation

* Fixed Clippy warnings

* Updated CI Python version

* cleaner dim specification
2022-11-07 17:45:52 +00:00
guillaume-be
340be36ed9
Mixed resources (#291)
* - made `merges` resource optional for all pipelines
- allow mixing local and remote resources for pipelines

* Updated changelog

* Fixed Clippy warnings
2022-10-30 07:39:52 +00:00
guillaume-be
78da0f4814
Added example for 2-turn conversation in docs (#287) 2022-10-08 09:31:20 +01:00
guillaume-be
c9f7e8653d
- Fixed 1.64 Clippy warnings (#284) 2022-10-08 08:38:10 +01:00
guillaume-be
59c0e668d6
- Fixed RoBERTa confic checks for sentence classification (#281) 2022-09-10 08:50:25 +01:00
guillaume-be
e4a2a102e0
Rust 1.63 update (#280)
* Fix new clippy warnings, addition of type aliases

* Docstrings formatting

* Fixed doctests
2022-09-05 19:08:08 +01:00
guillaume-be
cce1e2707d
Prepare for 0.19 release (#272) 2022-07-25 06:36:02 +01:00
guillaume-be
66d596a2bf
Handle empty generation inputs (#268)
* Handle zeo-length slice input for text generation (alias to None)

* Handle zeo-length slice input for text generation (return zero-length vector))

* Handle zero-length prompts (seed with BOS token)

* Updated zero-length prompts attention mask to match Python reference

* Fix clippy warnings
2022-07-17 08:53:08 +01:00
guillaume-be
76a94d48f0
Update to torch 1.12 (#266)
* Update to torch 1.12

* updated readme
2022-07-09 09:55:57 +01:00
guillaume-be
ed4f4f457d
Updated README Libtorch link to CXX11 ABI (#264) 2022-07-04 17:43:20 +01:00
guillaume-be
a1595e6dfd
Updated sentence embeddings example (#263)
* Added conversion information for Distil-based sentence embedding models

* Fix Clippy warnings
2022-07-03 08:48:31 +01:00
Romain Leroux
4d8a298586
Add sbert implementation for inference (#250)
* Add sbert implementation for inference

* Fix clippy warnings

* Refactor sentence embeddings into a dedicated pipeline

* Add output_attentions and output_hidden_states to T5Config

* Add sbert implementation for inference

* Fix clippy warnings

* Refactor sentence embeddings into a dedicated pipeline

* Add output_attentions and output_hidden_states to T5Config

* Improve sentence_embeddings implementation

* Dedicated tokenizer config for strip_accents and add_prefix_space

* Rename forward to encode_as_tensor

* Remove _conf from Dense layer

* Add sentence embeddings docs

* Addition of remote resources and tests update

* Merge feature branch and fix doctests

* Add SentenceEmbeddingsBuilder<Remote> and improve remote resources

* Use tch::no_grad in sentence embeddings

* Updated changelog, registration of sentence embeddings integration tests

Co-authored-by: Guillaume Becquin <guillaume.becquin@gmail.com>
2022-06-21 20:24:09 +01:00
guillaume-be
6b20da41de
0.18.0 release (#257)
* Fixed Clippy warnings

* Revert "Shallow clone optimization (#243)"

This reverts commit ba584653bc.

* prepare for 0.18.0 release
2022-05-29 12:59:00 +01:00
guillaume-be
6556ee3f13
tch 0.7.2 update (#256)
* Fixed Clippy warnings

* Revert "Shallow clone optimization (#243)"

This reverts commit ba584653bc.

* - Updated `tch` backed version

* reverted spurious xlnet changes
2022-05-19 19:02:38 +01:00
sftse
3df5ea5d37
Add token offset information to entities (#255)
* Add token offset information to entities

* replace unwrap by error propagation

Co-authored-by: guillaume-be <guillaume.becquin@gmail.com>
2022-05-18 20:44:50 +01:00
Mark Lodato
e5a51b0176
Use resize instead of append to pad features (#254)
* Use `resize` instead of `append` to pad features

**This Commit**
Updates various instances of `append(&mut vec![...])` with
`resize(...)`.

**Why?**
As a micro optimization. I don't expect this to affect any benchmarks
since the change will be so small compared to the time it takes the
model to do anything but I ran a small benchmark and this seemed to be
the fastest way to do this because (I think):

- we allocate each attention mask exactly once to the correct capacity
- we don't allocate a new vector to append to the existing one[^1]

And, while this won't speed up anything in practice, I think it might
read more clearly since `resize` tells you the final length so we can
see that all the vectors are the same final length and match `max_len`.

<details>
<summary>Micro benchmark run</summary>

The rust code was roughly:

```rust
pub fn append(len: usize, max: usize) -> Vec<usize> {
    let mut v = vec![1; len];
    v.append(&mut vec![0; max - v.len()]);
    v
}
pub fn resize(len: usize, max: usize) -> Vec<usize> {
    let mut v = Vec::with_capacity(max);
    v.resize(len, 1);
    v.resize(max, 0);
    v
}
pub fn halfway(len: usize, max: usize) -> Vec<usize> {
    let mut v = vec![1; len];
    v.resize(max, 0);
    v
}
pub fn overwrite(len: usize, max: usize) -> Vec<usize> {
    let mut v = vec![1; max];
    v[len..max].fill(0);
    v
}
```

and the parameters were roughly:

```rust
    for size in [10, 500, 1000, 5000] {
        for max in [size, size + 1, size * 2, size * 10] {
```

and `resize` was consistently the fastest. `halfway` was similar most of
the time but consistently slightly slower. `overwrite` was slower than
those for reasons I don't understand and `append` was consistently the
slowest (though, of course, the difference was very small when we were
appending zero or one elements).

</details>

[^1]: I can't really read assembly but in [this small godbolt
example][0] I see `__rust_alloc`, `__rust_alloc_zeroed`, and
`do_reserve_and_handle` so I don't think the compiler is seeing the
upcoming allocation and handling it all on the initial allocation.

[0]: https://godbolt.org/z/eTsnjn9Tq

* Padding simplification for sequence generation pipelines

* Move call to `.get_pad_id` outside loop

**Why?**
Because it's the same for every iteration.

See [this comment][0] for more details.

[0]: https://github.com/guillaume-be/rust-bert/pull/254/files#r873138871

* Remove comments on `pad_features`

**Why?**
I tried to add some comments but didn't understand the problem space
well enough to correctly document what the returned masks do.

See [this comment][0] for more details.

[0]: https://github.com/guillaume-be/rust-bert/pull/254/files#r873138314

Co-authored-by: Guillaume Becquin <guillaume.becquin@gmail.com>
2022-05-16 17:45:05 +01:00
guillaume-be
8c81ab4207
Make mode token aggregation deterministic (#253)
* Fixed Clippy warnings

* Revert "Shallow clone optimization (#243)"

This reverts commit ba584653bc.

* Made mode aggregation deterministic

* revert spurious change

* Moved BERT test to CPU allowing test suite to run with <8GB accelerator buffer

* Update src/pipelines/token_classification.rs

Co-authored-by: Mark Lodato <marklodato0@gmail.com>

* Update src/pipelines/token_classification.rs

Co-authored-by: Mark Lodato <marklodato0@gmail.com>

Co-authored-by: Mark Lodato <marklodato0@gmail.com>
2022-05-12 18:00:38 +01:00
Mark Lodato
c5faadcdf0
Use Vec in place of HashMap<usize, T> (#252)
**This Commit**
Attempts to simplify the `predict` function in the
`token_classification` pipeline by substituting a `HashMap` whose keys
are indices into a `Vec`.

**Why?**
Because the `HashMap` eagerly creates token buckets for all indices from
`0..input.len()` we can get the same behavior by using a `Vec`. This
cleans up some later code that was sorting on index because the `Vec`
maintains order by index naturally.

**Note**
I also switched from `get_mut().unwrap()` to `[]` notation because it
was the same but shorter. Happy to revert that if the
`get_mut().unwrap()` is specifically preferred for quickly finding panic
points by grepping for `unwrap` or something!

**Note**
I wrote a benchmark and it didn't seem to make it faster or slower but
hopefully that benchmark will be slightly helpful to those in the future
🤞.
2022-05-10 18:59:55 +01:00
guillaume-be
b49d853b20
Configuration defaults (#251)
* Fixed Clippy warnings

* Revert "Shallow clone optimization (#243)"

This reverts commit ba584653bc.

* Addition of defaults for Albert and BART configs

* Addition of BERT, DeBERTa, DeBERTaV2 default configs

* Addition of DistilBERT, Electra, FNet, GPT2, GPTNeo, Longformer configs

* Addition of default for all model configurations

* Cleaned-up config aliases and fixed tests

* Fixed Clippy warnings
2022-04-25 17:59:45 +01:00
lsb
04b25923ad
Update mobilebert_model.rs (#249)
* Update mobilebert_model.rs

Fix typo in comments

* Update fnet_model.rs

Same typo
2022-04-21 18:39:16 +01:00
Romain Leroux
4162cf1c3c
Fix for ALBERT Attention (#246)
Selects the correct Option<Tensor> for attention in AlbertTransformer.
Besides all_hidden_states might also be incorrect.

Co-authored-by: guillaume-be <guillaume.becquin@gmail.com>
2022-04-16 18:40:24 +01:00
guillaume-be
eff7082150
Revert shallow clone (#247)
* Fixed Clippy warnings

* Revert "Shallow clone optimization (#243)"

This reverts commit ba584653bc.

* Fixed Clippy warning
2022-04-15 12:19:42 +01:00
guillaume-be
ba584653bc
Shallow clone optimization (#243)
* Fixed Clippy warnings

* Shallow clone optimization (reduce tensor copy)

* Updated changelog and fixed Clippy warnings
2022-04-10 08:52:37 +01:00
guillaume-be
6f1888e8f9
Merge pull request #242 from guillaume-be/deberta_v2_implementation
Deberta v2 implementation
2022-04-03 10:01:14 +01:00
Guillaume Becquin
dfc96c9b9f Updated tokenizers dependency 2022-04-02 09:22:06 +01:00
Guillaume Becquin
253d51ba83 Added tests, documentation and fixed Clippy warnings 2022-03-27 09:12:32 +01:00
Guillaume Becquin
c2ba629201 Deberta V2 registration in pipelines 2022-03-27 08:40:37 +01:00
Guillaume Becquin
c7eea6f7ad Addition od Deberta V2 task-specific heads 2022-03-26 16:49:20 +00:00
Guillaume Becquin
3963d3a59f Addition of DebertaV2Model and DebertaV2ForMaskedLM 2022-03-26 09:22:43 +00:00
Guillaume Becquin
61f12ae671 Implementation of DebertaV2Conv 2022-03-22 20:53:07 +00:00
Guillaume Becquin
eff522e363 Implementation of DebertaV2Encoder 2022-03-20 09:31:42 +00:00
Guillaume Becquin
ad77d4140c Merge remote-tracking branch 'origin/master' into deberta_v2_implementation
# Conflicts:
#	Cargo.toml
2022-03-20 08:34:16 +00:00
guillaume-be
641162871a
Merge pull request #238 from guillaume-be/individual_token_scores
Return individual token scores
2022-03-20 08:04:35 +00:00