fairseq/examples/gottbert
Raphael Scheible f3d5045a71 add German RoBERTa model (GottBERT) (#2992)
Summary:
# Before submitting

- There is no related issue for this pull request.
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- We did not see any necessity for tests.

## What does this PR do?
Add German RoBERTa model (GottBERT)

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2992

Reviewed By: alexeib

Differential Revision: D25494927

Pulled By: myleott

fbshipit-source-id: b6790124d7c3c8dc387c141706cd8a527cc950ab
2020-12-11 19:10:49 -08:00
..
README.md add German RoBERTa model (GottBERT) (#2992) 2020-12-11 19:10:49 -08:00

GottBERT: a pure German language model

Introduction

GottBERT is a pretrained language model trained on 145GB of German text based on RoBERTa.

Example usage

fairseq

Load GottBERT from torch.hub (PyTorch >= 1.1):
import torch
gottbert = torch.hub.load('pytorch/fairseq', 'gottbert-base')
gottbert.eval()  # disable dropout (or leave in train mode to finetune)
Load GottBERT (for PyTorch 1.0 or custom models):
# Download gottbert model
wget https://dl.gottbert.de/fairseq/models/gottbert-base.tar.gz
tar -xzvf gottbert.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import GottbertModel
gottbert = GottbertModel.from_pretrained('/path/to/gottbert')
gottbert.eval()  # disable dropout (or leave in train mode to finetune)
Filling masks:
masked_line = 'Gott ist <mask> ! :)'
gottbert.fill_mask(masked_line, topk=3)
# [('Gott ist gut ! :)',        0.3642110526561737,   ' gut'),
#  ('Gott ist überall ! :)',    0.06009674072265625,  ' überall'),
#  ('Gott ist großartig ! :)',  0.0370681993663311,   ' großartig')]
Extract features from GottBERT
# Extract the last layer's features
line = "Der erste Schluck aus dem Becher der Naturwissenschaft macht atheistisch , aber auf dem Grunde des Bechers wartet Gott !"
tokens = gottbert.encode(line)
last_layer_features = gottbert.extract_features(tokens)
assert last_layer_features.size() == torch.Size([1, 27, 768])

# Extract all layer's features (layer 0 is the embedding layer)
all_layers = gottbert.extract_features(tokens, return_all_hiddens=True)
assert len(all_layers) == 13
assert torch.all(all_layers[-1] == last_layer_features)

Citation

If you use our work, please cite:

@misc{scheible2020gottbert,
      title={GottBERT: a pure German Language Model},
      author={Raphael Scheible and Fabian Thomczyk and Patric Tippmann and Victor Jaravine and Martin Boeker},
      year={2020},
      eprint={2012.02110},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}