mirror of https://github.com/facebookresearch/fairseq.git synced 2024-11-12 21:52:01 +03:00

History

Louis Martin 4948d890a4 Fix link in CamemBERT readme (#2722 ) Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [x] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Fix link in CamemBERT readme ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: https://github.com/pytorch/fairseq/pull/2722 Reviewed By: louismartin Differential Revision: D24307327 Pulled By: myleott fbshipit-source-id: c3c29a19de06a8062fa7f7212ad6df0d549ad25f	2020-10-14 09:47:26 -07:00
..
README.md	Fix link in CamemBERT readme (#2722 )	2020-10-14 09:47:26 -07:00

Louis Martin 4948d890a4 Fix link in CamemBERT readme (#2722 )

Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fix link in CamemBERT readme

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2722

Reviewed By: louismartin

Differential Revision: D24307327

Pulled By: myleott

fbshipit-source-id: c3c29a19de06a8062fa7f7212ad6df0d549ad25f

2020-10-14 09:47:26 -07:00

README.md

Fix link in CamemBERT readme (#2722 )

2020-10-14 09:47:26 -07:00

README.md

CamemBERT: a Tasty French Language Model

Introduction

CamemBERT is a pretrained language model trained on 138GB of French text based on RoBERTa.

Also available in github.com/huggingface/transformers.

Pre-trained models

Model	#params	Download	Arch.	Training data
`camembert` / `camembert-base`	110M	camembert-base.tar.gz	Base	OSCAR (138 GB of text)
`camembert-large`	335M	camembert-large.tar.gz	Large	CCNet (135 GB of text)
`camembert-base-ccnet`	110M	camembert-base-ccnet.tar.gz	Base	CCNet (135 GB of text)
`camembert-base-wikipedia-4gb`	110M	camembert-base-wikipedia-4gb.tar.gz	Base	Wikipedia (4 GB of text)
`camembert-base-oscar-4gb`	110M	camembert-base-oscar-4gb.tar.gz	Base	Subsample of OSCAR (4 GB of text)
`camembert-base-ccnet-4gb`	110M	camembert-base-ccnet-4gb.tar.gz	Base	Subsample of CCNet (4 GB of text)

Example usage

fairseq

Load CamemBERT from torch.hub (PyTorch >= 1.1):

import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert')
camembert.eval()  # disable dropout (or leave in train mode to finetune)

Load CamemBERT (for PyTorch 1.0 or custom models):

# Download camembert model
wget https://dl.fbaipublicfiles.com/fairseq/models/camembert-base.tar.gz
tar -xzvf camembert.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import CamembertModel
camembert = CamembertModel.from_pretrained('/path/to/camembert')
camembert.eval()  # disable dropout (or leave in train mode to finetune)

Filling masks:

masked_line = 'Le camembert est <mask> :)'
camembert.fill_mask(masked_line, topk=3)
# [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'),
#  ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'),
#  ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]

Extract features from Camembert:

# Extract the last layer's features
line = "J'aime le camembert !"
tokens = camembert.encode(line)
last_layer_features = camembert.extract_features(tokens)
assert last_layer_features.size() == torch.Size([1, 10, 768])

# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
assert len(all_layers) == 13
assert torch.all(all_layers[-1] == last_layer_features)

Citation

If you use our work, please cite:

@inproceedings{martin2020camembert,
  title={CamemBERT: a Tasty French Language Model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  year={2020}
}